Notes for Ma221 Lecture 12, Oct 6 2009 Goals for today: Section 3.1-3.3 Example: polynomial fitting formulate as minimizing norm of A*x-b, A(i,j) = y(i)^(j-1), where (y(i),b(i)) are sample points run polyfit31 to show results Standard notation argmin_x norm(A*x-b), A mxn, m>n Other variants (code in LAPACK): constrained: argmin_x norm(Ax-b) st Bx=y #rows(A) <= #cols(A) <= #rows(A) + #rows(B) so answer unique weighted: argmin_x norm(y) st d = Ax+By weighted because if B = I , it would be usual problem if B square: argmin_x norm(inv(B)*(d-Ax)) underdetermined: If #row(A) < #col(A), then whole set of solutions (add any z st Az=0 to x), so to make unique seek argmin_x norm(x) st Ax=b Algorithms: Normal equations - A^T*A*x = A^T*b A^T*A is spd, so solve with Cholesky fastest (in dense case), but not stable if A ill-conditioned QR decomposition A = Q*R Gram-Schmidt - unstable Householder - stable blocked Householder - to minimize data movement SVD - most "complete" answer, shows exactly how small changes in A, b could perturb solution x, best when A ill-conditioned conversion to linear system - allows iterative refinement (Q 3.3) Normal equations vanishing gradient of norm(A*x-b,2)^2 why minimizer?: complete square of norm(A*(x+e)-b,2)^2 picture, use of Pythagorean Thm #flops: m*n^2 (A'*A) + n^3/3 (Cholesky) Gram-Schmidt A = Q*R (dimensions) Assuming we can compute A = Q*r, how do we solve argmin_x norm(A*x-b)? x = R\(Q'*b) Why? proof 1: A = Q*R = [Q,Qhat]*[R;0] where [Q,Qhat] square, orthogonal ... proof 2: Ax - b = Q*R*x -b = Q*R*x - Q*Q'*b + Q*Q'*b - b = Q(R*x-Q'*b) + (Q*Q'-I)*b which are orthogonal ... proof 3: plug into normal equations Algorithms: Classical and Modified Gram-Schmidt for i=1:n Q(:,i) = A(:,i) for j=1:i-1 r(j,i) = Q(:,j)'*A(:,i) ... CGS, costs 2m r(j,i) = Q(:,j)'*Q(:,i) ... MGS, costs 2m Q(:,i) - Q(:,i) - r(j,i)*Q(:,j) ... costs 2m end r(i,i) = norm(Q(:,i),2) ... costs 2m Q(:,i) = Q(:,i)/r(i,i) ... costs m end total #flops = 4m*n^2/2 +O(mn) = 2mn^2 + O(mn), about 2x normal equations plots showing instability: run QRStability with cnd = 1e1, 1e2 1e4 1e8, 6x4, samples = 100 Eventually: will explain stable implementation; Also needed for eigenvalue problems SVD = Singular Value Decomposition (History: first complete version goes back to Eckart&Young in 1936, First algorithm by Golub & Kahan in 1965, faster ones since then, to be discussed in Chapter 5) Thm. Suppose A = m x n with m >= n. Then there are U = [u(1),...,u(n)] = m x n with orthonormal columns Sigma = diag(sigma(1),...,sigma(n)) = n x n diagonal with sigma(1) >= sigma(2) >= ... >= 0 V = [v(1),...,v(m)] = n x n orthogonal with A = U*Sigma*V^T. u(i) called left singular vectors sigma(i) called singular values v(i) called right singular vectors Geometric interpretation: Thinking of A as a linear mapping from R^n to R^m, with the right orthogonal choice of bases of R^n (i.e. V) and R^m (i.e. U) A is diagonal (i.e. Sigma) Proof: Induction: assuming A nonzero (else easy) ||A||_2 = max_x ||A*x||_2 / ||x||_2 = max_{x: ||x||_2 = 1} ||A*x||_2 Let v(1) be x attaining the max, sigma(1) = ||A||_2, and u(1) = A*v(1) / ||A*v(1)||_2 . Choose V = [v(1),Vhat] to be square & orthogonal Choose U = [u(1),Uhat] to be square & orthogonal Write Ahat = U^T*A*V = [ u(1)^T*A*v(1) u(1)^T*A*Vhat ] [ Uhat'*A*v(1) Uhat'*A*Vhat ] = [ sigma(1) A12 ] [ A21 A22 ] Show A21 = 0 by def of Uhat Show A12 = 0 by def of sigma(1) = ||A||_2 Apply induction to A22 = U_2*Sigma_2*V_2^T Thm: The SVD has many useful properties (assume A m x n with m >= n) Fact 1: If A symmetric with eigenvalues lambda_i and orthonormal eigenvector v(i), then its SVD is A = V*Lambda*V^T (done if all lambda_i >= 0) = (V*D)*(D*Lambda)*V^T where D = diag(sign(lambda(i))) = U*Sigma*V^T Fact 2: The eigenvalue decomposition of A^T*A = (U*Sigma*V^T)^T*(U*Sigma*V^T) = V*Sigma^2*V^T (what happens if m>n?) Fact 3: The eigenvalue decomposition of A*A^T = (U*Sigma*V^T)*(U*Sigma*V^T)^T = U*Sigma^2*U^T (what happens if m>n?) Fact 4: Let H = [ 0 A^T ] be 2m x 2m, where we assume A is m x n [ A 0 ] Then H has eigenvalues +- sigma(i) and eigenvectors 1/sqrt(2)*[v(i) ; +- u(i)] (plug in A = U*Sigma*V^T) Fact 5: If A has full rank, then argmin_x norm(Ax-b,2) = V*inv(Sigma)*U^T*b (plug in A = U*Sigma*V^T) Fact 6: ||A||_2 = sigma(1), ||inv(A)||_2 = 1/sigma(n) and ||A||_2 * ||inv(A)|_2 = sigma(1)/sigma(n) Fact 7: Suppose sigma(1) >= ... >= sigma(r) > 0 = sigma(r+1) = ... = sigma(n). Then A has rank r; the null-space of A is span(v(r+1),...,v(n)), of dimension n-r, and the range space of A is span(u(1),...,u(r)), of dimension r Fact 8: Let S be the unit sphere in R^n. Then A*S is an ellipsoid centered at the origin with principal axes sigma(i)*u(i) (write A*S = U*Sigma*V^T*S = U*Sigma*S = sum_i u(i)*sigma(i)*w(i) where sum_i w(i)^2 = 1 * (matlab demo svddemo2, svddemo3) Fact 9: Matrix A_k of rank k closest to A in 2-norm is A_k = sum_{i=1 to k} u_i*sigma(i)*v(i)^T = U*Sigma_k*V^T where Sigma_k = diag (sigma(1) , ... , sigma(k), 0, ... 0) and the distance is norm(A - A_k , 2) = sigma(k+1) (easy to see that A_k has right rank, right distance to A; need to show no closer one: Suppose B has rank k, so null space has dimension n-k. The space spanned by {v(1),...,v(k+1)} has dimension k+1. Since the sum of the dimension (n-k)+(k+1) > n, these two space must overlap (can't have > n independent vectors in R^n). Let h be unit vector in their intersection. then norm(A-B,2) >= norm((A-B)*h,2) = norm(A*h,2) = norm(U*Sigma*V^T*h,2) = norm(Sigma*V^T*h,2) = norm(Sigma*[x(1),...,x(k+1),0,...,0]^T) >= sigma(k+1) (matlab demo: load clown.mat, [U,S,V]=svd(X); colormap('gray') figure(1), clf, image(X), figure(2), clf, for k=[1:10,20:10:200], Xk=U(:,1:k)*S(1:k,1:k)*(V(:,1:k))'; ... err = norm(X-Xk)/norm(X); image(Xk), colormap('gray'), ... title(['k= ',int2str(k),' err= ', num2str(err)]), pause, end Def: If A = U*Sigma*V^T is full rank, its (Moore-Penrose) pseudoinverse is A^+ = V*inv(Sigma)*U^T Facts: A^+ = inv(A^T*A)*A^T Solution to argmin_x norm(A*x-b) = A^+*b In matlab, A^+ = pinv(A) (when A not full rank or #cols>#rows, def later) Perturbation theory for least squares problem: If x = argmin_x norm(A*x-b,2), and we change A and b a little, how much can x change? Since square case is special case, expect cond(A) to show up, but there is another source of sensitivity: A = [1;0], b = [0,b2], => x = inv(A^T*A)*A^T*b = 1*0 = 0 A = [1;0], b = [b1,b2], => x = 1*b1 = b1 If b2 very large, b1 could be small compared to b2 but still large, so big change in x More generally, if b (nearly) orthogonal to span(A), then condition number very large Expand e = (x+e) - x = inv((A+dA)^T(A+dA)*(A+dA)^T*(b+db) - inv(A^T*A)*A^T*b, only keep linear terms (one factor of dA and/or db), call eps = max(norm(dA)/norm(A) , norm(db)/norm(b)) to get norm(e)/norm(x) <= eps*(2*cond(A)/cos(theta) + tan(theta)*cond(A)^2) + O(eps^2) where theta = angle(b,A*x), or sin(theta) = norm(Ax-b,2)/norm(b,2) So condition number can be large if (1) cond(A) large (2) theta near pi/2, i.e. 1/cos(theta) ~ tan(theta) large (3) error like cond(A)^2 when theta not near zero Will turn out that the right QR decomposition keeps eps = O(machine epsilon), so backward stable Normal equations are not backward stable; since we do Chol(A^T*A) the error bound is always proportional to cond(A)^2, even when theta small. Stable algorithms for QR decomposition: Recall that MGS and CGS produced Qs that were far from orthogonal, even though norm(A-Q*R) was as small as desired. For solving least squares problems (and later eigenproblems and computing the SVD) we need algorithms that keep Q very nearly orthogonal, i.e. norm(Q^T*Q-I) = O(macheps) Note: MGS still has its uses in some other algorithms, CGS does not Basic idea: express Q as product of simple orthogonal matrices Q=Q(1)*Q(2)*... where each Q(i) accomplishes part of the task (multiplying it by A makes some entry zero, until A turn into R). Since a product of orthogonal matrices is orthogonal, there is no danger of losing orthogonality. There are two kinds of simple orthogonal matrices: Householder transformations Givens rotations. A Householder transformation (or reflection) is of the form H = I-2*u*u^T, where u is a unit vector. We confirm orthogonality as follows: H*H^T = (I-2*u*u^T)*(I-2*u*u^T) = I - 4*u*u^T + 4*u*u^T*u*u^T = I It is called a reflection because H*x is a reflection of x in the plane orthogonal to u (picture). Given x, we can to find u so that H*x has zeros in particular locations, in particular we want u so that H*x = [c,0,0,...,0]^T = c*e1 for some constant c. Since the 2-norm of both sides is ||x||_2, then |c| = ||x||_2. We solve for u as follows: Write H*x =(I-2*u*u^T)*x = x - 2*u*(u^t*x) = c*e1, or u = (x-c*e1)/(2*u^T*x). The denominator is just a constant that we can choose so that ||u||_2 = 1, i.e. y = x +- ||x||_2*e1 and u = y/||y||_2. We write this as u = House(x). (picture of two choices). How to do QR decomposition by forming Q as a product of Householder transformations (picture of 5 x 4 example) QR decomposition of m x n matrix A, m >= n for i = 1 to min(m-1,n) ... only need to do last column if m > n u(i) = House(A(i:m,i)) ... compute Householder vector A(i:m,i:n) = (I - 2*u(i)*u(i)^T)*A(i:m,i:n) = A(i:m,i:n) - u(i)*(2*u(i)^T*A(i:m,i:n)) Note: we never actually form each Householder matrix Q(i) = I-2*u(i)*u(i)^T, or multiply them to get Q, we only multiply by them. We only need to store the u(i), which are kept in A in place of the data they are used to zero out (just like the entries of L in Gaussian elimination). (picture) The cost is sum_{i=1 to n} 4*(m-i+1)*(n-i+1) = 2n^2m - (2/3)n^3 = (4/3)n^3 when m=n, twice the cost of GE So when we are done we get Q(n)*Q(n-1)*...*Q(1)*A = R or A = Q(1)^T * ... * Q(n)^T * R = Q(1)*...*Q(n)*R = Q*R and to solve the least squares problem argmin_x ||Ax-b||_2 we do x = R \ ( Q^T * x) or for i=1 to n c = -2*u(i)^T*b(i:m) ... dot product b(i:m) = b(i:m) + c * u(i) ... "saxpy" endfor b = R \ b which costs O(mn), much less than A = QR (again analogous to GE) In Matlab, x = A\b does GEPP if A is square, and solves the least squares problems using Householder QR when A has more rows than columns. Optimizing QR decomposition: We can (and should!) use all the ideas of Chapter 2 to minimizing how much data is moved by QR decomposition. We have only described the BLAS2 version (analogous to simplest version of LU). We will not do the details of how to reorganize the algorithm to do lots of matmuls, and reduce the amount of data moved to O(#flops / sqrt(memory_size)) (see Question 3.17), but this can be done. There is one other way to represent Q as a product of simple orthogonal transformations, called Givens rotations. They are useful in other cases than least squares problems (for which we use Householder) so we present them here. A Givens rotation R(theta) = [ cos(theta) -sin(theta) ] [ sin(theta) cos(theta) ] rotates x counterclockwise by theta (picture) More generally, we will take components i and j of a vector and rotate them: R(i,j,theta) = identity except for rows and columns i and j, containing entries of R(theta) (picture) To do QR, we want to create zero entries. So given x(i) and x(j), we want to pick theta so that R(i,j,theta)*x is 0 in entry j: [ cos(theta) -sin(theta) ] * [ x(i) ] = [ sqrt(x(i)^2+x(j)^2) ] [ sin(theta) cos(theta) ] [ x(j) ] = [ 0 ] or cos(theta) = x(i)/sqrt(x(i)^2+x(j)^2), sin(theta) = -x(j)/sqrt(x(i)^2+x(j)^2) We do QR by repeatedly multiplying A by R(i,j,theta) matrices to introduce new zeros, until A becomes upper triangular (picture). Stability of applying Orthogonal matrices By using the basic rule fl(a op b) = (a op b)*(1+delta) , |delta| <= macheps one can show that for either multiplying by one Householder or Givens transformation Q one gets the following, where Q' is the floating point transformation, and Q is the "exact", orthogonal transformation: fl( Q'*A ) = Q*A + E where norm(E) = O(macheps)*norm(A) i.e. you get an exact orthogonal transformation Q*A plus a small error. We need to distinguish between Q' and Q because we can't compute u = y/||y|| or x(i)/sqrt(x(i)^2+x(j)^2) perfectly, but the errors are very small, and for the rest of the analysis we want an exactly orthogonal Q. We can also write this as fl(Q'*A) = Q(A + Q^T*E) = Q*(A+F), where norm(E)=norm(F), i.e. multiplying by Q is backward stable: we get the exact orthogonal transformation of a slightly different matrix A+F. Now multiply by a sequence of orthogonal matrices, as in QR decomposition: fl(Q3'*(Q2'*(Q1'*A))) = fl(Q3'*(Q2'*(Q1*A+E1))) = fl(Q3'*(Q2*(Q1*A+E1)+E2)) = Q3*(Q2*(Q1*A+E1)+E2)+E3 = Q3*Q2*Q1*A + Q3*Q2*E1 + Q3*E2 + E3 = Q3*Q2*Q1*A + E = Q*A + E = Q*(A + Q^T*E) = Q*(A+F) where ||F|| = ||E|| <= ||E1|| + ||E2|| + ||E3|| = O(macheps) so multiplying by many orthogonal matrices is also backwards stable. This analysis explains not just why QR is stable (and so why norm(Q^T*Q-I) = O(macheps), unlike Gram-Schmidt)) but will explain why our eigenvalue algorithms are stable, since they (mostly) just multiply by orthogonal matrices. Solving Least Squares Problems when A is (nearly) rank-deficient: Ex: Try A = [ 1 0 ], b = [ 1 ], yielding x = [ 1 ] where e tiny [ 0 e ] [ 1 ] [ 1/e ] [ 0 0 ] [ 1 ] doubling (or halving) e changes A very little in norm but doubles (or halves) size of x, so x very sensitive, as expected since cond(A) = 1/e is very large Now try with e = 0, so A rank deficient; now answer not unique (since if x is solution and A*z=0, then x+z is a solution). From homework Question 3.12 we know that ||A*x-b||_2 = ||A*x-b||_F minimized by x = A^+*b, which also minimizes ||x||_2 = ||x||_f when x not unique, i.e. x = [ 1 0 ]^+ * [ 1 ] = [ 1 0 0 ] * [ 1 ] = [ 1 ] [ 0 0 ] [ 1 ] [ 0 0 0 ] [ 1 ] [ 0 ] [ 0 0 ] [ 1 ] [ 1 ] So the answer changes a lot again. Now suppose that A has some uncertainty (it almost always does), so that its entries are only known up to some tolerance +-tol > e, so that both matrices A above could equally well be the "true" A. How do we get a meaningful answer? Need to "regularize" the solution somehow, i.e. impose some extra conditions to get an answer that depends more smoothly on the data. One way to do this is to use the "truncated SVD": Def: If A = U*Sigma*V^T is the SVD, and tol>0 is a measure of the uncertainly in A, the truncated SVD of A is A' = U*Sigma'*V^T where Sigma' is the same as Sigma except for setting the singular values sigma(i)tol <= (1/sigma(k)) * || U^T*(b1-b2) ||_2 <= (1/tol) * || b1-b2 ||_2 As before, the SVD makes everything easy, but it is expensive, several times as much as QR (depends on your computer), so an alternative to truncated SVD is the QR decomposition with column pivoting: Suppose we did A = QR exactly, with A of rank r < n; what would R look like? If the leading r columns of A were full rank (true for "most" such A), then R = [ R11 R12 ] with R11 r x r and full rank, so R22 = 0. [ 0 R22 ] If A nearly low-rank, we can hope for || R22 || < tol, and set it to zero. Assuming that this works for a moment, write A = QR = [Q,Q']*[ R ] with [Q,Q'] = [Q1,Q2,Q'] square and orthogonal [ 0 ] as before, with Q1 m x r, Q2 m x (n-r) and Q' m x (m-n) Thus argmin_x || Ax-b ||_2 = argmin_x || [Q1,Q2,Q']*[ R ] * x - b ||_2 [ 0 ] = argmin_x || [ R11 R12 ] * [ x1 ] - [ Q1^T*b ] ||_2 [ 0 0 ] [ x2 ] [ Q2^T*b ] [ 0 0 ] [ Q'^T*b ] = argmin_x || [ R11*x1 + R12*x2 - Q1^T*b ] ||_2 [ - Q2^T*b ] [ - Q'^T*b ] with solution x1 = inv(R11)*(Q1^T_b - R12*x2) for any x2 But how do we pick x2 to minimize || x ||_2? and is the solution as smooth as desired, i.e. could cond(R11) be >> 1/tol? Ex: A = .5*eye(n) - diag(ones(n,1),1) is already upper triangular, so A = QR = R. But cond(A) ~ 2^n, (inv(A))(i,j) = 2^(j-i+1) above diagonal) so at least one singular value is ~ 2^(-n), and should be set to zero when n is large. But if we order the columns of A to be Ap = A(:,[2:n,1]) and then do QR, we get Ap = QR with R22 = R(n,n) ~ 2(-n) the other R(i,i) = O(1), and R11 = R(1:n-1,1:n-1) with cond(R11) = O(1), R12 = R(1:n-1,n) with norm(R12) = O(1) So choosing x2 = 0 and x1 = inv(R11)*Q1^T*b is a solution with small norm, as desired. (matlab demo) How to choose permutation in general? Algorithm called column pivoting: at each step, look at the remaining columns not yet reduced to upper triangular form, and pick the one with the biggest norm Available in Matlab as [Q,R,P]=qr(A) (permutation returned in A), which uses LAPACK routine (dgeqrp.f)