Notes for Ma221 Lecture 12, Oct 6 2009

Goals for today: Section 3.1-3.3

Example: polynomial fitting 
         formulate as minimizing norm of A*x-b, A(i,j) = y(i)^(j-1),
            where (y(i),b(i)) are sample points
         run polyfit31 to show results
Standard notation argmin_x norm(A*x-b), A mxn, m>n
Other variants (code in LAPACK):
      constrained: argmin_x norm(Ax-b) st Bx=y
              #rows(A) <= #cols(A) <= #rows(A) + #rows(B)
              so answer unique
      weighted: argmin_x norm(y) st d = Ax+By
             weighted because if B = I , it would be usual problem
             if B square: 
               argmin_x norm(inv(B)*(d-Ax))
      underdetermined: If #row(A) < #col(A), then whole set of solutions
         (add any z st Az=0 to x), so to make unique seek
           argmin_x norm(x) st Ax=b

Algorithms:
    Normal equations - A^T*A*x = A^T*b  
       A^T*A is spd, so solve with Cholesky
       fastest (in dense case), but not stable if A ill-conditioned
    QR decomposition A = Q*R
       Gram-Schmidt - unstable
       Householder - stable
       blocked Householder - to minimize data movement
    SVD - most "complete" answer, shows exactly how small changes in A, b
          could perturb solution x, best when A ill-conditioned
    conversion to linear system - allows iterative refinement (Q 3.3)

Normal equations
    vanishing gradient of norm(A*x-b,2)^2
    why minimizer?: complete square of norm(A*(x+e)-b,2)^2
    picture, use of Pythagorean Thm
    #flops: m*n^2 (A'*A) + n^3/3 (Cholesky)

Gram-Schmidt A = Q*R  (dimensions)
  Assuming we can compute A = Q*r, how do we solve argmin_x norm(A*x-b)?
    x = R\(Q'*b)
  Why?
    proof 1: A = Q*R = [Q,Qhat]*[R;0] where [Q,Qhat] square, orthogonal ...
    proof 2: Ax - b = Q*R*x -b = Q*R*x - Q*Q'*b + Q*Q'*b - b 
                    = Q(R*x-Q'*b) + (Q*Q'-I)*b which are orthogonal ...
    proof 3: plug into normal equations
  Algorithms:
    Classical and Modified Gram-Schmidt
      for i=1:n
        Q(:,i) = A(:,i)
        for j=1:i-1
           r(j,i) = Q(:,j)'*A(:,i) ... CGS, costs 2m
           r(j,i) = Q(:,j)'*Q(:,i) ... MGS, costs 2m
           Q(:,i) - Q(:,i) - r(j,i)*Q(:,j) ... costs 2m
        end
        r(i,i) = norm(Q(:,i),2) ... costs 2m
        Q(:,i) = Q(:,i)/r(i,i)  ... costs m
      end
    total #flops = 4m*n^2/2 +O(mn) = 2mn^2 + O(mn), about 2x normal equations
    plots showing instability: 
          run QRStability with cnd = 1e1, 1e2 1e4 1e8, 6x4, samples = 100
    Eventually: will explain stable implementation;
        Also needed for eigenvalue problems

SVD = Singular Value Decomposition
     (History: first complete version goes back to Eckart&Young in 1936,
      First algorithm by Golub & Kahan in 1965, faster ones since then,
      to be discussed in Chapter 5)

   Thm. Suppose A = m x n with m >= n. Then there are
           U = [u(1),...,u(n)] = m x n with orthonormal columns
           Sigma = diag(sigma(1),...,sigma(n)) = n x n diagonal with 
                  sigma(1) >= sigma(2) >= ... >= 0
           V = [v(1),...,v(m)] = n x n orthogonal
        with A = U*Sigma*V^T. 
           u(i) called left singular vectors
           sigma(i) called singular values
           v(i) called right singular vectors

   Geometric interpretation: Thinking of A as a linear mapping from R^n
     to R^m, with the right orthogonal choice of bases of R^n (i.e. V) and 
     R^m (i.e. U) A is diagonal (i.e. Sigma)

   Proof: Induction: 
          assuming A nonzero (else easy)
             ||A||_2 = max_x ||A*x||_2 / ||x||_2
                      = max_{x: ||x||_2 = 1} ||A*x||_2
              Let v(1) be x attaining the max, sigma(1) = ||A||_2,
              and u(1) = A*v(1) / ||A*v(1)||_2 .
          Choose V = [v(1),Vhat] to be square & orthogonal
          Choose U = [u(1),Uhat] to be square & orthogonal
          Write Ahat = U^T*A*V
                 = [ u(1)^T*A*v(1)  u(1)^T*A*Vhat ]
                   [ Uhat'*A*v(1)   Uhat'*A*Vhat  ]
                 = [ sigma(1)       A12 ]
                   [ A21            A22 ]
          Show A21 = 0 by def of Uhat
          Show A12 = 0 by def of sigma(1) = ||A||_2
          Apply induction to A22 = U_2*Sigma_2*V_2^T

   Thm: The SVD has many useful properties (assume A m x n with m >= n)
      Fact 1: If A symmetric with eigenvalues lambda_i and orthonormal eigenvector v(i),
              then its SVD is A = V*Lambda*V^T (done if all lambda_i >= 0)
                                = (V*D)*(D*Lambda)*V^T where D = diag(sign(lambda(i)))
                                = U*Sigma*V^T
      Fact 2: The eigenvalue decomposition of A^T*A = (U*Sigma*V^T)^T*(U*Sigma*V^T) = V*Sigma^2*V^T
              (what happens if m>n?)
      Fact 3: The eigenvalue decomposition of A*A^T = (U*Sigma*V^T)*(U*Sigma*V^T)^T = U*Sigma^2*U^T
              (what happens if m>n?)
      Fact 4: Let H = [ 0 A^T ] be 2m x 2m, where we assume A is m x n
                      [ A  0  ]
              Then H has eigenvalues +- sigma(i) and eigenvectors 1/sqrt(2)*[v(i) ; +- u(i)]
              (plug in A = U*Sigma*V^T)
      Fact 5: If A has full rank, then argmin_x norm(Ax-b,2) = V*inv(Sigma)*U^T*b
              (plug in A = U*Sigma*V^T)
      Fact 6: ||A||_2 = sigma(1), ||inv(A)||_2 = 1/sigma(n) and 
              ||A||_2 * ||inv(A)|_2 = sigma(1)/sigma(n)
      Fact 7: Suppose sigma(1) >= ... >= sigma(r) > 0 = sigma(r+1) = ... = sigma(n). Then
              A has rank r; the null-space of A is span(v(r+1),...,v(n)), of dimension n-r, 
              and the range space of A is span(u(1),...,u(r)), of dimension r
      Fact 8: Let S be the unit sphere in R^n. Then A*S is an ellipsoid centered at the
              origin with principal axes sigma(i)*u(i)
              (write A*S = U*Sigma*V^T*S = U*Sigma*S = sum_i u(i)*sigma(i)*w(i) where sum_i w(i)^2 = 1
*             (matlab demo svddemo2, svddemo3)
      Fact 9: Matrix A_k of rank k closest to A in 2-norm is 
              A_k = sum_{i=1 to k} u_i*sigma(i)*v(i)^T = U*Sigma_k*V^T
              where Sigma_k = diag (sigma(1) , ... , sigma(k), 0, ... 0)
              and the distance is norm(A - A_k , 2) = sigma(k+1)
              (easy to see that A_k has right rank, right distance to A; need to show no closer one:
               Suppose B has rank k, so null space has dimension n-k. The space spanned by
               {v(1),...,v(k+1)} has dimension k+1. Since the sum of the dimension (n-k)+(k+1) > n,
               these two space must overlap (can't have > n independent vectors in R^n).
               Let h be unit vector in their intersection. then
               norm(A-B,2) >= norm((A-B)*h,2) = norm(A*h,2)
                            = norm(U*Sigma*V^T*h,2) = norm(Sigma*V^T*h,2)
                            = norm(Sigma*[x(1),...,x(k+1),0,...,0]^T)
                           >= sigma(k+1)
              (matlab demo: load clown.mat, [U,S,V]=svd(X); colormap('gray')
               figure(1), clf, image(X),
               figure(2), clf, for k=[1:10,20:10:200], Xk=U(:,1:k)*S(1:k,1:k)*(V(:,1:k))'; ...
                          err = norm(X-Xk)/norm(X); image(Xk), colormap('gray'), ...
                          title(['k= ',int2str(k),' err= ', num2str(err)]), pause, end

   Def: If A = U*Sigma*V^T is full rank, its (Moore-Penrose) pseudoinverse is
           A^+ = V*inv(Sigma)*U^T

   Facts: A^+ = inv(A^T*A)*A^T
          Solution to argmin_x norm(A*x-b) = A^+*b
          In matlab, A^+ = pinv(A)  (when A not full rank or #cols>#rows, def later)

   Perturbation theory for least squares problem:
   If x = argmin_x norm(A*x-b,2), and we change A and b a little, how much can x change?
   Since square case is special case, expect cond(A) to show up, but there is another
   source of sensitivity:
      A = [1;0], b = [0,b2],  => x = inv(A^T*A)*A^T*b = 1*0 = 0
      A = [1;0], b = [b1,b2], => x = 1*b1 = b1 
       If b2 very large, b1 could be small compared to b2 but still large, so big change in x
   More generally, if b (nearly) orthogonal to span(A), then condition number very large

   Expand e = (x+e) - x = inv((A+dA)^T(A+dA)*(A+dA)^T*(b+db) - inv(A^T*A)*A^T*b,
   only keep linear terms (one factor of dA and/or db), call
     eps = max(norm(dA)/norm(A) , norm(db)/norm(b)) to get
    norm(e)/norm(x) <= eps*(2*cond(A)/cos(theta) + tan(theta)*cond(A)^2) + O(eps^2)
     where theta = angle(b,A*x), or sin(theta) = norm(Ax-b,2)/norm(b,2)
    So condition number can be large if
      (1) cond(A) large
      (2) theta near pi/2, i.e. 1/cos(theta) ~ tan(theta) large
      (3) error like cond(A)^2 when theta not near zero

   Will turn out that the right QR decomposition keeps eps = O(machine epsilon),
     so backward stable

   Normal equations are not backward stable; since we do Chol(A^T*A) the
   error bound is always proportional to cond(A)^2, even when theta small.

Stable algorithms for QR decomposition:
   Recall that MGS and CGS produced Qs that were far from orthogonal, even though
   norm(A-Q*R) was as small as desired.  For solving least squares problems
   (and later eigenproblems and computing the SVD) we need algorithms that keep
   Q very nearly orthogonal, i.e. norm(Q^T*Q-I) = O(macheps)
   Note: MGS still has its uses in some other algorithms, CGS does not

   Basic idea: express Q as product of simple orthogonal matrices Q=Q(1)*Q(2)*...
   where each Q(i) accomplishes part of the task (multiplying it by A makes some
   entry zero, until A turn into R).  Since a product of orthogonal matrices 
   is orthogonal, there is no danger of losing orthogonality.

   There are two kinds of simple orthogonal matrices: Householder transformations 
   Givens rotations. 

   A Householder transformation (or reflection) is of the form H = I-2*u*u^T, 
   where u is a unit vector. We confirm orthogonality as follows:
    H*H^T = (I-2*u*u^T)*(I-2*u*u^T) = I - 4*u*u^T + 4*u*u^T*u*u^T = I
   It is called a reflection because H*x is a reflection of x in the plane orthogonal to u
   (picture).
   Given x, we can to find u so that H*x has zeros in particular locations, in particular
   we want u so that H*x = [c,0,0,...,0]^T = c*e1 for some constant c. Since the 2-norm of both
   sides is ||x||_2, then |c| = ||x||_2. We solve for u as follows: Write
      H*x =(I-2*u*u^T)*x = x - 2*u*(u^t*x) = c*e1, or u = (x-c*e1)/(2*u^T*x).
   The denominator is just a constant that we can choose so that ||u||_2 = 1,
   i.e. y = x +- ||x||_2*e1 and u = y/||y||_2. We write this as u = House(x).
   (picture of two choices).

   How to do QR decomposition by forming Q as a product of Householder transformations
   (picture of 5 x 4 example)

   QR decomposition of m x n matrix A, m >= n
      for i = 1 to min(m-1,n) ... only need to do last column if m > n
         u(i) = House(A(i:m,i))   ... compute Householder vector
         A(i:m,i:n) = (I - 2*u(i)*u(i)^T)*A(i:m,i:n)
                    = A(i:m,i:n) - u(i)*(2*u(i)^T*A(i:m,i:n))

   Note: we never actually form each Householder matrix Q(i) = I-2*u(i)*u(i)^T,
   or multiply them to get Q, we only multiply by them.
   We only need to store the u(i), which are kept in A in place of the
   data they are used to zero out (just like the entries of L in Gaussian elimination).
   (picture)

   The cost is sum_{i=1 to n} 4*(m-i+1)*(n-i+1) = 2n^2m - (2/3)n^3 = (4/3)n^3 when m=n,
   twice the cost of GE

   So when we are done we get
     Q(n)*Q(n-1)*...*Q(1)*A = R or
     A = Q(1)^T * ... * Q(n)^T * R = Q(1)*...*Q(n)*R = Q*R
   and to solve the least squares problem argmin_x ||Ax-b||_2 we do
     x = R \ ( Q^T * x) 
   or
     for i=1 to n
       c = -2*u(i)^T*b(i:m)       ... dot product
       b(i:m) = b(i:m) + c * u(i) ... "saxpy"
     endfor
     b = R \ b
   which costs O(mn), much less than A = QR (again analogous to GE)

   In Matlab, x = A\b does GEPP if A is square, and solves the least 
   squares problems using Householder QR when A has more rows than columns.

   Optimizing QR decomposition: We can (and should!) use all the ideas of Chapter 2
   to minimizing how much data is moved by QR decomposition. We have only described
   the BLAS2 version (analogous to simplest version of LU). We will not do the
   details of how to reorganize the algorithm to do lots of matmuls, and reduce
   the amount of data moved to O(#flops / sqrt(memory_size)) (see Question 3.17),
   but this can be done.
      
   There is one other way to represent Q as a product of simple orthogonal
   transformations, called Givens rotations. They are useful in other
   cases than least squares problems (for which we use Householder) so we
   present them here.

   A Givens rotation R(theta) = [ cos(theta) -sin(theta) ]
                                [ sin(theta)  cos(theta) ]
   rotates x counterclockwise by theta (picture)
   More generally, we will take components i and j of a vector and rotate them:
   R(i,j,theta) = identity except for rows and columns i and j, containing entries of R(theta)
   (picture)
   To do QR, we want to create zero entries. So given x(i) and x(j), we want to pick theta
   so that R(i,j,theta)*x is 0 in entry j:
       [ cos(theta)  -sin(theta) ] * [ x(i) ] = [ sqrt(x(i)^2+x(j)^2) ]
       [ sin(theta)   cos(theta) ]   [ x(j) ] = [            0        ]
   or
       cos(theta) = x(i)/sqrt(x(i)^2+x(j)^2),  sin(theta) = -x(j)/sqrt(x(i)^2+x(j)^2)
   We do QR by repeatedly multiplying A by R(i,j,theta) matrices to introduce new zeros,
   until A becomes upper triangular (picture).

   Stability of applying Orthogonal matrices
     By using the basic rule fl(a op b) = (a op b)*(1+delta) , |delta| <= macheps
     one can show that for either multiplying by one Householder or Givens transformation Q
     one gets the following, where Q' is the floating point transformation,
     and Q is the "exact", orthogonal transformation:
        fl( Q'*A ) =  Q*A + E where norm(E) = O(macheps)*norm(A)
     i.e. you get an exact orthogonal transformation Q*A plus a small error.
     We need to distinguish between Q' and Q because we can't compute u = y/||y||
     or x(i)/sqrt(x(i)^2+x(j)^2) perfectly, but the errors are very small, and 
     for the rest of the analysis we want an exactly orthogonal Q.

     We can also write this as fl(Q'*A) = Q(A + Q^T*E) = Q*(A+F), where norm(E)=norm(F),
     i.e. multiplying by Q is backward stable: we get the exact orthogonal transformation
     of a slightly different matrix A+F.

     Now multiply by a sequence of orthogonal matrices, as in QR decomposition:
        fl(Q3'*(Q2'*(Q1'*A))) = fl(Q3'*(Q2'*(Q1*A+E1)))
                              = fl(Q3'*(Q2*(Q1*A+E1)+E2))
                              = Q3*(Q2*(Q1*A+E1)+E2)+E3
                              = Q3*Q2*Q1*A + Q3*Q2*E1 + Q3*E2 + E3
                              = Q3*Q2*Q1*A + E
                              = Q*A + E
                              = Q*(A + Q^T*E)
                              = Q*(A+F)
        where ||F|| = ||E|| <= ||E1|| + ||E2|| + ||E3|| = O(macheps)
     so multiplying by many orthogonal matrices is also backwards stable.

     This analysis explains not just why QR is stable (and so why norm(Q^T*Q-I) = O(macheps),
     unlike Gram-Schmidt)) but will explain why our eigenvalue algorithms are stable,
     since they (mostly) just multiply by orthogonal matrices.

  Solving Least Squares Problems when A is (nearly) rank-deficient:
    Ex: Try  A = [ 1 0 ],   b = [ 1 ],  yielding x = [  1  ]  where e tiny
                 [ 0 e ]        [ 1 ]                [ 1/e ]
                 [ 0 0 ]        [ 1 ] 
        doubling (or halving) e changes A very little in norm
        but doubles (or halves) size of x, so x very sensitive,
        as expected since cond(A) = 1/e is very large

        Now try with e = 0, so A rank deficient; now answer not unique
        (since if x is solution and A*z=0, then x+z is a solution).
        From homework Question 3.12 we know that ||A*x-b||_2 = ||A*x-b||_F minimized 
        by x = A^+*b, which also minimizes ||x||_2 = ||x||_f when
        x not unique, i.e. x = [ 1 0 ]^+ * [ 1 ] = [ 1 0 0 ] * [ 1 ] = [ 1 ]
                               [ 0 0 ]     [ 1 ]   [ 0 0 0 ]   [ 1 ]   [ 0 ]
                               [ 0 0 ]     [ 1 ]               [ 1 ] 
        So the answer changes a lot again.

        Now suppose that A has some uncertainty (it almost always does),
        so that its entries are only known up to some tolerance +-tol > e,
        so that both matrices A above could equally well be the "true" A.
        How do we get a meaningful answer?

        Need to "regularize" the solution somehow, i.e. impose some extra
        conditions to get an answer that depends more smoothly on the data.
        One way to do this is to use the "truncated SVD":

        Def: If A = U*Sigma*V^T is the SVD, and tol>0 is a measure of the
        uncertainly in A, the truncated SVD of A is A' = U*Sigma'*V^T
        where Sigma' is the same as Sigma except for setting the
        singular values sigma(i) tol
        <= (1/sigma(k)) * || U^T*(b1-b2) ||_2
        <= (1/tol) * || b1-b2 ||_2

        As before, the SVD makes everything easy, but it is expensive, several times
        as much as QR (depends on your computer), so an alternative to truncated SVD
        is the QR decomposition with column pivoting: 
         Suppose we did A = QR exactly, with A of rank r < n; what would R look like?
         If the leading r columns of A were full rank (true for "most" such A), then
           R = [ R11 R12 ] with R11 r x r and full rank, so R22 = 0.
               [  0  R22 ]
         If A nearly low-rank, we can hope for  || R22 || < tol, and set it to zero.
         Assuming that this works for a moment,
         write A = QR = [Q,Q']*[ R ] with [Q,Q'] = [Q1,Q2,Q'] square and orthogonal 
                               [ 0 ]
         as before, with Q1 m x r, Q2 m x (n-r) and Q' m x (m-n)
         Thus argmin_x || Ax-b ||_2 = argmin_x || [Q1,Q2,Q']*[ R ] * x - b ||_2
                                                             [ 0 ]
                                    = argmin_x || [ R11 R12 ] * [ x1 ] - [ Q1^T*b ] ||_2
                                                  [  0   0  ]   [ x2 ]   [ Q2^T*b ]
                                                  [  0   0  ]            [ Q'^T*b ] 
               = argmin_x  || [ R11*x1 + R12*x2 - Q1^T*b ] ||_2
                              [                 - Q2^T*b ] 
                              [                 - Q'^T*b ] 
        with solution x1 = inv(R11)*(Q1^T_b - R12*x2) for any x2
        But how do we pick x2 to minimize || x ||_2? 
        and is the solution as smooth as desired, i.e. could cond(R11) be >> 1/tol?

        Ex: A = .5*eye(n) - diag(ones(n,1),1) is already upper triangular, 
            so A = QR = R. But cond(A) ~ 2^n, (inv(A))(i,j) = 2^(j-i+1) above diagonal)
            so at least one singular value is ~ 2^(-n), and should be set to zero when n is large.
            But if we order the columns of A to be Ap = A(:,[2:n,1]) and then do QR, we
            get Ap = QR with 
               R22 = R(n,n) ~ 2(-n) 
               the other R(i,i) = O(1), and R11 = R(1:n-1,1:n-1) with cond(R11) = O(1), 
               R12 = R(1:n-1,n) with norm(R12) = O(1)
            So choosing x2 = 0 and x1 = inv(R11)*Q1^T*b is a solution with small norm,
              as desired.
            (matlab demo)

        How to choose permutation in general?
        Algorithm called column pivoting: at each step, look at the remaining columns
          not yet reduced to upper triangular form, and pick the one with the biggest norm
        Available in Matlab as [Q,R,P]=qr(A) (permutation returned in A), which uses
        LAPACK routine (dgeqrp.f)