Notes for Math 221, Lecture 3, Sep 2 2010
Goals: Matrix and vector norms
Singular Value Decomposition
Condition numbers for Ax=b
Matrix and vector norms
Def norm : Let B be linear space R^n (or C^n).
It is normed if there is a function ||.||:B -> R s.t.
(1) ||x|| >= 0, and ||x|| = 0 iff x=0 (positive definiteness)
(2) ||c*x|| = |c|*||x|| (homogeneity)
(3) ||x+y|| <= ||x|| + ||y|| (triangle inequality)
Ex: p-norm = (sum_i |x_i|^p)^(1/p), for p >= 1
Euclidean norm = 2-norm
Note: x real => ||x||_2^2 = sum_i x_i^2 = x^T * x
infinity-norm = max_i |x_i|
C-norm = ||C*x|| where C nonsingular
Lemma (1.4): all norms are equivalent
i.e. given any norm_a(x) and norm_b(x), there are nonzero constants
alpha and beta such that
alpha*norm_a(x) <= norm_b(x) <= beta*norm_b(x)
(proof: compactness)
This lemma is an excuse to use the easiest norm to prove later results.
Def: matrix norm (vector norm on mxn vectors)
(1) ||A|| >= 0, and ||A|| = 0 iff A=0 (positive definiteness)
(2) ||c*A|| = |c|*||A|| (homogeneity)
(3) ||A+B|| <= ||A|| + ||B|| (triangle inequality)
Ex: max norm = max_ij |A_ij|
Frobenius norm = (sum_ij |A_ij|^2)^(1/2)
Def: operator norm
norm(A) = max_{nonzero x} norm(A*x)/norm(x)
Lemma (1.6): operator norm is a matrix norm
proof: homework Q 1.15
Lemma (1.7): if norm(A) operator norm, there exists x such that
norm(x)=1 and norm(A*x)=norm(A).
proof: norm(A) = max_{nonzero x} norm(A*x)/norm(x)
= max_{nonzero x} norm(A * x/norm(x) )
= max_{unit vector y} norm(A * y )
y attaining this maximum exists since norm(A*y) is a continuous
function of y on compact (closed and bounded) set = unit ball
Notation: Q^* = conj(Q^T), sometimes Q^H
Def: orthogonal matrix: Q square, real and inv(Q) = Q^T
unitary matrices: Q square, complex, and inv(Q) = Q^*
(for simplicity, state results for real case, but all extend to complex).
Fact: Q orthogonal <=> Q^T*Q = I <=> all columns mutually orthogonal
and are unit vectors, Q*Q^T = I implies same about rows
Fact: norm(Q*x,2)=norm(x,2) - aka Pythagorean theorem
Proof: norm(Q*x,2)^2 = (Q*x)^T*(Q*x) = x^T*Q^T*Q*x = x^T*x = norm(x,2)^2
Fact: Q and Z orthogonal => Q*Z orthogonal
Proof: (Q*Z)^T * (Q*Z) = Z^T*Q^T * Q*Z = Z^T * I * Z = I
Fact: If Q is m x n, with n= 0
and A^T*A*Q = Q*Lambda where Q = [q_1,...,q_n], Lambda = diag(l_1,...,l_n)
so A^T*A = Q*Lambda*Q^T and continuing from above
norm(A,2) = sqrt ( max_{nonzero x} x^T*Q*Lambda*Q^T*x / x^T*x )
= sqrt ( max_{nonzero x} x^T*Q*Lambda*Q^T*x / x^T*Q*Q^T*x )
= sqrt ( max_{nonzero y} y^T*Lambda*y / y^T*y )
where y = Q^T*x
= sqrt ( max_{nonzero y} sum_i y_i^2 l_i / sum_i y_i^2 )
<= sqrt ( max_{nonzero y} l_max * sum_i y_i^2 / sum_i y_i^2 )
= l_max, attained by choosing y_max = 1, rest 0
SVD = Singular Value Decomposition
History: first complete statement goes back to Eckart&Young in 1936,
First reliable algorithm by Golub & Kahan in 1965, faster ones since then,
to be discussed in Chapter 5; perhaps best algorithm appears in
2010 PhD thesis by Paul Willems, still to be incorporated into LAPACK
(learning about, testing this algorithm is a possible class project)
Thm. Suppose A = m x m, then there is an
orthogonal matrix U = [u(1),...,u(m)]
diagonal matrix Sigma = diag(sigma(1),...,sigma(m))
with sigma(1) >= sigma(2) >= ... >= 0
orthogonal matrix V = [v(m),...,v(m)]
with A = U*Sigma*V^T.
u(i) called left singular vectors
sigma(i) called singular values
v(i) called right singular vectors
More generally, if A = m x n with m > n, then
U m x m and orthogonal as before
V n x n and orthogonal
Sigma is m x n with same diagonal as before
When m > n, we sometimes write this as
[u(1),...,u(n)] * diag(sigma(1),...,sigma(n)) * V^T
Geometric interpretation: Thinking of A as a linear mapping from R^n
to R^m, with the right orthogonal choice of bases
of R^n (i.e. columns of V) and R^m (i.e. columns of U)
then A is diagonal (i.e. Sigma):
A = U*Sigma*V^T => A*V = U*Sigma => A*v(i) = sigma(i)*u(i)
Proof: Induction on n:
Two base cases
n=1: Let U have the first column = A/norm(A,2), rest chosen
any way that makes U orthogonal; Sigma = norm(A,2), V = 1
A=0: Let U = I_m, Sigma = 0, V = I_n
Induction step (if A nonzero):
||A||_2 = max_x ||A*x||_2 / ||x||_2
= max_{x: ||x||_2 = 1} ||A*x||_2
Let v(1) be x attaining the max, sigma(1) = ||A||_2,
and u(1) = A*v(1) / ||A*v(1)||_2 .
Choose V = [v(1),Vhat] to be square & orthogonal
Choose U = [u(1),Uhat] to be square & orthogonal
Write Ahat = U^T*A*V
= [ u(1)^T*A*v(1) u(1)^T*A*Vhat ]
[ Uhat'*A*v(1) Uhat'*A*Vhat ]
= [ sigma(1) A12 ]
[ A21 A22 ]
Show A21 = 0 by def of Uhat
Show A12 = 0 by def of sigma(1) = ||A||_2
Apply induction to A22 = U_2*Sigma_2*V_2^T
so A = U*Ahat*V^T = U * [ sigma(1) 0 ] * V^T
[ 0 U_2*Sigma_2*V_2^T ]
= U * [ 1 0 ] * [ sigma(1) 0 ] * [ 1 0 ] * V^T
[ 0 U_2 ] [ 0 Sigma_2 ] [ 0 V_2^T ]
= orthogonal * nonnegative_diagonal * orthogonal, as desired
Thm: The SVD has many useful properties (assume A m x n with m >= n)
Fact 1: In the square nonsingular case,
we can use it to solve A*x=b with O(n^2) more work:
Proof: X = inv(A)*b = inv(U*Sigma*V^T)*B = (V*inv(Sigma)*U^T)*b
But note: computing the SVD itself is expensive, O(n^3) (see Chap 5)
If all you want to do is solve A*x=b, Gaussian Elimination is cheaper.
On the other hand, we will see that the SVD is more reliable when
A is nearly singular, and even lets us "solve" A*x=b when A
is exactly singular.
Fact 2: When m>n, we can use it to solve the full rank least squares problem
argmin_x ||A*x-b||_2 as follows: writing A = U*Sigma*V' with U m x n,
then x = V*inv(Sigma)*U^T*b , same formula as the square case
Proof: Write A = Uhat*Sigmahat*V^T where Uhat = [U,U'] is m x m and
Sigmahat = [Sigma; 0] is m x n. Then
|| A*x-b ||_2^2 = || Uhat*Sigmahat*V^T*x - b ||_2^2
= || Uhat^T* ( " ) ||_2^2
= || Sigmahat*V^T*x - Uhat^T*b ||_2^2
= || [ Sigma*V^T*x - U^T*b ] ||^2
|| [ - U'^T*b] ||_2
= || Sigma*V^T*x - U^T*b ||_2^2
+|| -U'^T*b ||_2^2
is clearly minimized by choosing x = V*inv(Sigma)*U^T*b
to zero out the term depending on x
Def: When A = U*Sigma*V^T is m x n, m >= n, and full rank, then
A^+ = V*inv(Sigma)*U^T is n x m, and is called the
Moore-Penrose pseudoinverse of A,
This is the most natural extension of the definition of "inverse"
to rectangular full-rank matrices.
We can also use the SVD, and an appropriately defined Moore-Penrose
pseudoinverse to solve the rank deficient least squares problems,
or underdetermined problem (m < n), as we describe later.
Just to solve a least squares problem where you are not
worried about rank deficiency, the QR decomposition is cheaper.
On the other hand, we will see that the SVD is more reliable when
A is nearly singular.
Fact 3: If A symmetric with eigenvalues Lambda = diag(lambda_1,...,lambda_n)
and orthonormal eigenvectors V = [v(1),...,v(n)], then its SVD is
A = V*Lambda*V^T (done if all lambda_i >= 0)
= (V*D)*(D*Lambda)*V^T where D = diag(sign(lambda(i)))
= U*Sigma*V^T
Fact 4: The eigenvalue decomposition of
A^T*A = (U*Sigma*V^T)^T*(U*Sigma*V^T) = V*Sigma^2*V^T
(what happens if m>n?)
Fact 5: The eigenvalue decomposition of
A*A^T = (U*Sigma*V^T)*(U*Sigma*V^T)^T = U*Sigma^2*U^T
(what happens if m>n?)
Fact 6: Let H = [ 0 A^T ] be (m+n) x (m+n), where we assume A is m x n
[ A 0 ]
Then H has eigenvalues +- sigma(i)
and eigenvectors 1/sqrt(2)*[v(i) ; +- u(i)]
Proof: plug in A = U*Sigma*V^T
Fact 6 suggests that algorithms for the SVD and the symmetric eigenproblem
will be closely related (see Chap 5)
Fact 7: ||A||_2 = sigma(1), ||inv(A)||_2 = 1/sigma(n) and
Def: kappa(A) = sigma(1)/sigma(n) is called the condition number of A
for reasons we will see shortly
Fact 8: Let S be the unit sphere in R^n. Then A*S is an ellipsoid centered
at the origin with principal axes sigma(i)*u(i)
Proof: write A*S = U*Sigma*V^T*S = U*Sigma*S = sum_i u(i)*sigma(i)*w(i)
where sum_i w(i)^2 = 1
* (matlab demo a = randn(2,2), svddemo2; a = randn(3,3); svddemo3)
Fact 9: Suppose sigma(1) >= ... >= sigma(r) > 0 = sigma(r+1) = ... = sigma(n).
Then A has rank r; the null-space of A is span(v(r+1),...,v(n)),
of dimension n-r, and the range space of A is span(u(1),...,u(r)),
of dimension r
Fact 10: Matrix A_k of rank k closest to A in 2-norm is
A_k = sum_{i=1 to k} u_i*sigma(i)*v(i)^T = U*Sigma_k*V^T
where Sigma_k = diag (sigma(1) , ... , sigma(k), 0, ... 0)
and the distance is norm(A - A_k , 2) = sigma(k+1)
Proof: easy to see that A_k has right rank, right distance to A;
need to show no closer one:
Suppose B has rank k, so null space has dimension n-k.
The space spanned by {v(1),...,v(k+1)} has dimension k+1.
Since the sum of the dimensions (n-k)+(k+1) > n, these two spaces
must overlap (can't have > n independent vectors in R^n).
Let h be unit vector in their intersection. then
norm(A-B,2) >= norm((A-B)*h,2) = norm(A*h,2)
= norm(U*Sigma*V^T*h,2) = norm(Sigma*V^T*h,2)
= norm(Sigma*[x(1),...,x(k+1),0,...,0]^T)
>= sigma(k+1)
(matlab demo: load clown.mat, [U,S,V]=svd(X); colormap('gray')
figure(1), clf, image(X),
figure(2), clf, for k=[1:10,20:10:200], Xk=U(:,1:k)*S(1:k,1:k)*(V(:,1:k))'; ...
err = norm(X-Xk)/norm(X); compr = k*(200+320+1)/(200*320);
image(Xk), colormap('gray'), ...
title(['k= ',int2str(k),' err= ', num2str(err),' compression= ',num2str(compr)]),
pause, end
(Note: jpeg compression algorithm uses similar idea, on subimages)
Now start using this material to analyze the condition number
for matrix inversion and solving Ax=b: If A (and b) change
a little bit, how much can inv(A) (and x=inv(A)*b) change?
If |x|<1, recall that 1/(1-x) = 1+x^2+x^3+...
Now generalize to matrices:
Lemma: If operator norm(X)<1, then I-X is nonsingular and
inv(I - X) = sum_{i>=0} X^i and
norm(inv(I-X)) <= 1/(1-norm(X))
proof: claim I + X + X^2 + ... converges:
norm(X^i) <= norm(X)^i -> 0 as i increases
so by equivalence of norms max_{kj} | (X^i)_kj | <= C* norm(X)^i
and so kj-th entry of I + X + X^2 ... dominated by
convergent geometric sequence and so converges
claim (I - X)*( I + X + X^2 + ... + X^i)
= I - X^(i+1) -> I as i increases
next norm(I + X + X^2 + ...)
<= norm(I) + norm(X) + norm(X^2) + ... by triangle inequality
<= norm(I) + norm(X) + norm(X)^2 + ... by ||A*B||<=||A||*||B||
= 1 + norm(X) + norm(X)^2 + ... by def of operator norm
= 1/(1 - norm(X)) ... geometric sum
(Later: generalize to arbitrary Taylor expansions, like e^X = sum_i X^i/i!)
Lemma: Suppose A invertible. Then A-E invertible if norm(E) < 1/norm(inv(A))
in which case
inv(A-E) = Z + Z(EZ) + Z(EZ)^2 + Z(EZ)^3 + ... where Z=inv(A)
and norm(inv(A-E)) <= norm(Z)/(1-norm(E)*norm(Z))
proof: inv(A-E) = inv((I-E*Z)*A) = Z*inv(I-E*Z)
exists if I-E*Z invertible i.e. if
norm(E*Z) < 1 i.e. if
norm(E)*norm(Z) < 1 in which case
inv(A-E) = Z*(1+EZ + (EZ)^2 + ...)
take norms
What does this say for 1x1 matrices?
Why can't we write inv(A-E) = Z + EZ^2 + E^2Z^3 + ... ?
Finally, we can ask how much inv(A) and inv(A+E) differ.
Lemma: Suppose A invertible. If norm(E) < 1/norm(Z), then
norm(inv(A-E)-Z) <= norm(Z)^2*norm(E)/(1-norm(E)*norm(Z))
proof: inv(A-E)-Z = Z(EZ) + Z(EZ)^2 + ...
= ZEZ ( I + EZ + (EZ)^2 + ...)
and take norms
relative change in inv(A) is
norm(inv(A-E)-Z)/norm(Z)
<= [ norm(A)*norm(Z) * 1/(1 - norm(E)*norm(Z)) ] * [ norm(E)/norm(A) ]
= [ norm(A)*norm(Z) ] * [ norm(E)/norm(A) ] + O(norm(E)^2)
= condition_number * relative_change_in_A
What does this say for 1x1 matrices?
Thus justifies the following definition
DEF condition number kappa(A) = norm(A)*norm(Z)
Fact: kappa(A) >= 1.
proof: 1 = norm(I) = norm(A*Z) <= norm(A)*norm(Z)
Theorem: min{norm(E)/norm(A): A+E singular}
= relative_distance(A, {singular matrices})
= 1/kappa(A)
proof for 2-norm, using SVD: min{norm(E)}: A+E singular} = sigma_min(A),
so relative_distance(A,{singular}) = sigma_min(A)/sigma_max(A)
And norm(Z) = norm(inv(A)) = norm(V*inv(Sigma)*U^T) = norm(inv(Sigma))
= 1/sigma_min(A)
We've looked at sensitivity of inv(A), now look at solving A*x=b
Now consider A*x = b vs (A-E)*x' = b+f where x' = x+dx
subtract to get
A*dx - E*x - E*dx = f
(A-E)dx = f + E*x
dx = inv(A-E)*(f + E*x)
norm(dx) = norm(inv(A-E)*(f + E*x))
<= norm(inv(A-E))*(norm(f) + norm(E)*norm(x))
<= norm(Z)/(1-norm(E)*norm(Z))*(norm(f) + norm(E)*norm(x))
norm(dx)/norm(x)
<= norm(Z)*norm(A) * 1/(1-norm(E)*norm(Z))
*( norm(f)/(norm(A)*norm(x) + norm(E)/norm(A))
<= norm(Z)*norm(A) * 1/(1-norm(E)*norm(Z))
*( norm(f)/(norm(b) + norm(E)/norm(A))
relerr in x <= kappa(A)
* something close to 1 unless A nearly singular
* (rel change in b + rel change in A)
Our algorithms will attempt to guarantee that
(*) computed solution of A*x=b is (A-E)*x' = b+f where
norm(f)/norm(b) = O(macheps) and norm(E)/norm(A) = O(macheps)
so relerr in x ~ O( kappa(A)*macheps )
Recall that property (*) is called "backward stability"
Another practical approach: given x', how accurate a solution is it?
compute residual r = A*x'-b = A*x'-A*x = A*(x'-x) = A*error
so error = inv(A)*r and norm(error) <= norm(inv(A))*norm(r)
norm(r) also determines the backward error in A
Theorem: The smallest E such that (A+E)x' = b has norm(E) = norm(r)/norm(x')
proof: r = A*x' - b = -E*x' so norm(r) <= norm(E)*norm(x')
to attain lower bound on norm(E), choose E = -r*x'^T/norm(x')^2, in 2 norm
In other words, if Ax'-b is small, then the backwards error is small,
which is probably the most you should ask for when the entries of A are uncertain.
(Later we will see how to use "iterative refinement", aka Newton's method,
to get a tiny error in x as long as kappa(A) is not about 1/macheps or larger,
but this is only sometimes justified.)
All our practical error bounds depend on norm(inv(A)). To actually compute inv(A)
costs several times as much as just solving A*x=b (2n^3 versus (2/3)n^3),
so we will use cheap estimates of norm(inv(A)) that avoid computing
inv(A) explicitly, and work with small probability of large error.
The idea is to use the definition
|| inv(A) || = max_{||x|| = 1} || inv(A)*x || = max_{||x||<= 1} || inv(A)*x ||
and do gradient ascent ("go uphill") on || inv(A)*x ||
on the convex set ||x|| <= 1; one may also start with a random starting vector.
For right choice of norm it is easy to figure out ascent direction,
and each step requires solving Ax=b for some b, which only costs O(n^2),
(assuming we have already done LU factorization).
In practice it usually takes at most 5 steps or so to stop ascending
so the total costs is O(n^2); see text for details
In fact there is a theorem (Demmel, Diament, Malajovich, 2000) that says that
estimating kappa(A) even roughly, but still with some guarantee (say "to within
a factor of 10", or within any constant factor) is as about expensive as
multiplying matrices, which in turn is about as expensive as doing Gaussian
elimination in the first place.
Since our goal is to spend just an additional O(n^2) to estimate norm(inv(A)),
given that we have already done Gaussian Elimination (LU factorization),
this theorem implies that we need to settle for a small probabilitly of
getting a large error in our estimate of norm(inv(A)).
Where to find implementations of all this?
Matlab: A\b or [P,L,U]=lu(A) or condest or normest1
LAPACK: xGETRF just for GEPP where x = S/D/C/Z
xGESV to solve A*x=b
xGESVX for condition estimation, more
xGECON for condition estimation alone
CLAPACK: similar
ScaLAPACK: PxGESV, etc