Notes for Ma221 Lecture 13, Oct 7 2010

Goals for today: Section 3.1-3.3 (recall SVD)

Example: polynomial fitting 
         formulate as minimizing norm of A*x-b, A(i,j) = y(i)^(j-1),
            where (y(i),b(i)) are sample points
         run polyfit31 to show results
Standard notation argmin_x norm(A*x-b), A mxn, m>n
Other variants (code in LAPACK):
      constrained: argmin_x norm(Ax-b) st Bx=y
              #rows(B) <= #cols(A) <= #rows(A) + #rows(B)
              so answer unique if matrices involved are also full rank
      weighted: argmin_x norm(y) st d = Ax+By
             weighted because if B = I , it would be usual problem
             if B square: 
               argmin_x norm(inv(B)*(d-Ax))
      underdetermined: If #row(A) < #col(A), then whole set of solutions
         (add any z st Az=0 to x), so to make unique seek
           argmin_x norm(x) st Ax=b

Algorithms (all are used in various situations)
    Normal equations - A^T*A*x = A^T*b  
       A^T*A is spd, so solve with Cholesky
       fastest in dense case (fewest flops, least communication)
       but not stable if A ill-conditioned
    QR decomposition A = Q*R
       Gram-Schmidt - unstable, if A ill-conditioned
       Householder - stable
       blocked Householder - also minimizes data movement
       (also possible to use normal equations approach to get 
        Q explicitly, called CholeskyQR, same pros and cons as
        normal equations: fast but can be unstable)
    SVD - most "complete" answer, shows exactly how small changes in A, b
          could perturb solution x, best when A ill-conditioned
    conversion to linear system - allows iterative refinement (Q 3.3)

Normal equations
    vanishing gradient of norm(A*x-b,2)^2 = (A*x-b)^T*(A*x-b)
       compute limit as e->0 of  (A*(x+e)-b)^T*(A*(x+e)-b)/||e||^2
    why minimizer?: 
        (1) note that Hessian A^TA is positive definite
        (2) complete square of 
             norm(A*(x+e)-b,2)^2 = (Ax-b - Ae)^T*(Ax-b - Ae)
                                 = (Ax-b)^T*(Ax-b) - 2e^T*A^T*(Ax-b) + (Ae)^T(Ae)
                                 = ||Ax-b||^2 + ||Ae||^2 
    picture, use of Pythagorean Thm
    #flops: m*n^2 (A'*A) + n^3/3 (Cholesky)

Gram-Schmidt A = Q*R  (dimensions)
  Assuming we can compute A = Q*R, how do we solve argmin_x norm(A*x-b)?
    x = R\(Q'*b)
  Why?
    proof 1: A = Q*R = [Q,Qhat]*[R;0] where [Q,Qhat] square, orthogonal,
             and compute || [Q,Qhat]^T*(A*x-b) ||_2
    proof 2: Ax - b = Q*R*x -b = Q*R*x - Q*Q'*b + Q*Q'*b - b 
                    = Q(R*x-Q'*b) + (Q*Q'-I)*b which are orthogonal 
    proof 3: plug into normal equations

  Algorithms for computing A=Q*R
    Classical and Modified Gram-Schmidt
      for i=1:n
        tmp    = A(:,i)
        for j=1:i-1
           r(j,i) = Q(:,j)'*A(:,i) ... CGS, costs 2m
           r(j,i) = Q(:,j)'*tmp    ... MGS, costs 2m
           tmp    = tmp    - r(j,i)*Q(:,j) ... costs 2m
        end
        r(i,i) = norm(tmp   ,2) ... costs 2m
        Q(:,i) = tmp   /r(i,i)  ... costs m
      end

    total #flops = 4m*n^2/2 +O(mn) = 2mn^2 + O(mn), about 2x normal equations

    Two metrics of backward stability:
        If backward stable should get accurate QR decomposition of 
          slight perturbed input matrix A+E = QR where norm(E)/norm(A) = O(macheps)
        This means norm(Q*R-A)/norm(A) should be O(macheps), 
          just like we expect norm(A-PLU)/norm(A) to be O(macheps)
        But it also means norm(Q'*Q-I) should be O(macheps), an extra condition.
    plots showing instability: 
          run QRStability with cnd = 1e1, 1e2 1e4 1e8, m=6, n=4, samples = 100
    Eventually: will explain stable implementation;
        Also building block for eigenvalue and SVD algorithms

Recall SVD = Singular Value Decomposition

   Thm. Suppose A = m x m, then there is an
           orthogonal matrix U = [u(1),...,u(m)]
           diagonal matrix Sigma = diag(sigma(1),...,sigma(m))
                  with sigma(1) >= sigma(2) >= ... >= 0
           orthogonal matrix V = [v(m),...,v(m)]
        with A = U*Sigma*V^T.
           u(i) called left singular vectors
           sigma(i) called singular values
           v(i) called right singular vectors
        More generally, if A = m x n with m > n, then
           U m x m and orthogonal as before
           V n x n and orthogonal
           Sigma is m x n with same diagonal as before
        When m > n, we sometimes write this as
           [u(1),...,u(n)] * diag(sigma(1),...,sigma(n)) * V^T
        ==  Uhat * Sigmahat * V^T

  Recall Fact 2 from Lecture 3:  
      When m>n, we can use it to solve the full rank least squares problem
      argmin_x ||A*x-b||_2 as follows: writing A = Uhat*Sigmahat*V' as above
      then x = V*inv(Sigmahat)*Uhat^T*b 

  Def: If A = Uhat*Sigmahat*V^T is full rank, its (Moore-Penrose) pseudoinverse is
           A^+ = V*inv(Sigmahat)*Uhat^T

  Facts: A^+ = inv(A^T*A)*A^T  (i.e. SVD and normal equations give same answer!)
         In matlab, A^+ = pinv(A)  (when A not full rank or #cols>#rows, def later)

  Perturbation theory for least squares problem:
  If x = argmin_x norm(A*x-b,2), and we change A and b a little, how much can x change?
  Since square case is special case, expect cond(A) to show up, but there is another
  source of sensitivity:
     A = [1;0], b = [0,b2],  => x = inv(A^T*A)*A^T*b = 1*0 = 0
     A = [1;0], b = [b1,b2], => x = 1*b1 = b1 
     If b2 very large, b1 could be small compared to b2 but still large, so big change in x
   More generally, if b (nearly) orthogonal to span(A), then condition number very large

   Expand e = (x+e) - x = inv((A+dA)^T(A+dA)*(A+dA)^T*(b+db) - inv(A^T*A)*A^T*b,
       using approximation inv(I-X) = I+X + O(||X||^2) as before
   only keepiing linear terms (one factor of dA and/or db), call
     eps = max(norm(dA)/norm(A) , norm(db)/norm(b)) to get
    norm(e)/norm(x) <= eps*(2*cond(A)/cos(theta) + tan(theta)*cond(A)^2) + O(eps^2)
     where theta = angle(b,A*x), or sin(theta) = norm(Ax-b,2)/norm(b,2)
    So condition number can be large if
      (1) cond(A) large
      (2) theta near pi/2, i.e. 1/cos(theta) ~ tan(theta) large
      (3) error like cond(A)^2 when theta not near zero

   Will turn out that using the right QR decomposition, or SVD, 
   keeps eps = O(machine epsilon), so backward stable

   Normal equations are not backward stable; since we do Chol(A^T*A) the
   error bound is always proportional to cond(A)^2, even when theta small.

Stable algorithms for QR decomposition:
   Recall that MGS and CGS produced Qs that were far from orthogonal, even though
   norm(A-Q*R) was as small as desired.  For solving least squares problems
   (and later eigenproblems and computing the SVD) we need algorithms that keep
   Q very nearly orthogonal, i.e. norm(Q^T*Q-I) = O(macheps)
   Note: MGS still has its uses in some other algorithms, CGS does not

   Basic idea: express Q as product of simple orthogonal matrices Q=Q(1)*Q(2)*...
   where each Q(i) accomplishes part of the task (multiplying it by A makes some
   entry zero, until A turn into R).  Since a product of orthogonal matrices 
   is orthogonal, there is no danger of losing orthogonality.

   There are two kinds of simple orthogonal matrices: Householder transformations 
   Givens rotations. 

   A Householder transformation (or reflection) is of the form H = I-2*u*u^T, 
   where u is a unit vector. We confirm orthogonality as follows:
    H*H^T = (I-2*u*u^T)*(I-2*u*u^T) = I - 4*u*u^T + 4*u*u^T*u*u^T = I
   It is called a reflection because H*x is a reflection of x in the plane orthogonal to u
   (picture).
   Given x, we can to find u so that H*x has zeros in particular locations, in particular
   we want u so that H*x = [c,0,0,...,0]^T = c*e1 for some constant c. 
   Since the 2-norm of both sides is ||x||_2, then |c| = ||x||_2. 
   We solve for u as follows: Write
      H*x =(I-2*u*u^T)*x = x - 2*u*(u^t*x) = c*e1, or u = (x-c*e1)/(2*u^T*x).
   The denominator is just a constant that we can choose so that ||u||_2 = 1,
   i.e. y = x +- ||x||_2*e1 and u = y/||y||_2. We write this as u = House(x).
   We choose the sign of +- to avoid cancellation (avoids some numerical problems)
     y = [ x(1) + sign(x(1))*||x||_2 , x(2), ..., x(n) ]
         
   (picture of two choices).

   How to do QR decomposition by forming Q as a product of Householder transformations
   (picture of 5 x 4 example)

   QR decomposition of m x n matrix A, m >= n
      for i = 1 to min(m-1,n) ... only need to do last column if m > n
         u(i) = House(A(i:m,i))   ... compute Householder vector
         A(i:m,i:n) = (I - 2*u(i)*u(i)^T)*A(i:m,i:n)
                    = A(i:m,i:n) - u(i)*(2*u(i)^T*A(i:m,i:n))

   Note: we never actually form each Householder matrix Q(i) = I-2*u(i)*u(i)^T,
   or multiply them to get Q, we only multiply by them.
   We only need to store the u(i), which are kept in A in place of the
   data they are used to zero out (just like the entries of L in Gaussian elimination).
   (picture)

   The cost is sum_{i=1 to n} 4*(m-i+1)*(n-i+1) = 2n^2m - (2/3)n^3 = (4/3)n^3 when m=n,
   twice the cost of GE

   So when we are done we get
     Q(n)*Q(n-1)*...*Q(1)*A = R or
     A = Q(1)^T * ... * Q(n)^T * R = Q(1)*...*Q(n)*R = Q*R
   and to solve the least squares problem argmin_x ||Ax-b||_2 we do
     x = R \ ( Q^T * x) 
   or
     for i=1 to n
       c = -2*u(i)^T*b(i:m)       ... dot product
       b(i:m) = b(i:m) + c * u(i) ... "saxpy"
     endfor
     b = R \ b
   which costs O(mn), much less than A = QR (again analogous to GE)

   In Matlab, x = A\b does GEPP if A is square, and solves the least 
   squares problems using Householder QR when A has more rows than columns.

   Optimizing QR decomposition: We can (and should!) use all the ideas of Chapter 2
   to minimize how much data is moved by QR decomposition. We have only described
   the BLAS2 version (analogous to simplest version of LU). We will not do the
   details of how to reorganize the algorithm to do lots of matmuls, and reduce
   the amount of data moved to O(#flops / sqrt(memory_size)) (see Question 3.17),
   but this can be done.
      
   There is one other way to represent Q as a product of simple orthogonal
   transformations, called Givens rotations. They are useful in other
   cases than least squares problems (for which we use Householder) so we
   present them here.

   A Givens rotation R(theta) = [ cos(theta) -sin(theta) ]
                                [ sin(theta)  cos(theta) ]
   rotates x counterclockwise by theta (picture)
   More generally, we will take components i and j of a vector and rotate them:
   R(i,j,theta) = identity except for rows and columns i and j, containing entries of R(theta)
   (picture)
   To do QR, we want to create zero entries. So given x(i) and x(j), we want to pick theta
   so that R(i,j,theta)*x is 0 in entry j:
       [ cos(theta)  -sin(theta) ] * [ x(i) ] = [ sqrt(x(i)^2+x(j)^2) ]
       [ sin(theta)   cos(theta) ]   [ x(j) ] = [            0        ]
   or
       cos(theta) = x(i)/sqrt(x(i)^2+x(j)^2),  sin(theta) = -x(j)/sqrt(x(i)^2+x(j)^2)
   We do QR by repeatedly multiplying A by R(i,j,theta) matrices to introduce new zeros,
   until A becomes upper triangular (picture).

   Stability of applying Orthogonal matrices
     By using the basic rule fl(a op b) = (a op b)*(1+delta) , |delta| <= macheps
     one can show that for either multiplying by one Householder or Givens transformation Q
     one gets the following, where Q' is the floating point transformation actually represented
     in the machine, eg Q' = I - 2*u'*u'^T where u' is the computed Householder vector; as
     can be seen from the above formula, every component u'(i) is nearly equal to the exact 
     value u(i), but roundoff keeps it from being perfect.
        fl( Q'*A ) = Q'*A + E where norm(E) = O(macheps)*norm(A)
     This follows from our earlier analysis of dot products, etc. 
     Since Q' is nearly equal to the exact orthogonal Q, so Q' = Q+F with norm(F) = O(macheps),
     we can also write this as
        fl( Q'*A ) =  (Q+F)*A + E  = Q*A + (F*A + E) = Q*A + G 
     where norm(G) <= norm(F)*norm(A) + norm(E) = O(macheps)*norm(A).
     i.e. you get an exact orthogonal transformation Q*A plus a small error.
     The same result applies to Givens rotations and block Householder transformations
     (used to reduce communication).

     We can also write this as fl(Q'*A) = Q(A + Q^T*G) = Q*(A+F), 
     where norm(F)=norm(G)=O(macheps)*norm(A),
     i.e. multiplying by Q is backward stable: we get the exact orthogonal transformation
     of a slightly different matrix A+F.

     Now multiply by a sequence of orthogonal matrices, as in QR decomposition:
        fl(Q3'*(Q2'*(Q1'*A))) = fl(Q3'*(Q2'*(Q1*A+E1)))
                              = fl(Q3'*(Q2*(Q1*A+E1)+E2))
                              = Q3*(Q2*(Q1*A+E1)+E2)+E3
                              = Q3*Q2*Q1*A + Q3*Q2*E1 + Q3*E2 + E3
                              = Q3*Q2*Q1*A + E
                              = Q*A + E
                              = Q*(A + Q^T*E)
                              = Q*(A+F)
        where ||F|| = ||E|| <= ||E1|| + ||E2|| + ||E3|| = O(macheps)
     so multiplying by many orthogonal matrices is also backwards stable.

     This analysis explains not just why QR is stable 
     (and so why norm(Q^T*Q-I) = O(macheps), unlike Gram-Schmidt)) 
     but will explain why our eigenvalue algorithms are stable,
     since they (mostly) just multiply by orthogonal matrices.

Solving a Least Squares Problem with A is rank deficient
   How do we minimize ||A*x-b||_2 when A is rank deficient?
   Thm: Let A by m x n with m >= n, but rank r < n. 
        Let A = U*Sigma*V^T = [U1,U2,U3]*[Sigma1    0    ] * [V1,V2]^T
                                         [  0    Sigma 2 ]
                                         [  0       0    ]
        be the SVD of A where 
           U  is m x m, U1 is m x r, U2 is m x n-r, U3 is m x m-n
           Sigma is m x n, Sigma1 is r x r, Sigma2 is n-r x n-r and exactly 0
           V  is n x n, V1 is n x r, V2 is n x n-r
        Then the set of vectors x minimizing ||A*x-b||_2 can be written as
          {x =  V1 * Sigma1^(-1) * U1^T * b + V2 * y, y any vector of dimension n-r}
        The unique vector minimizing ||A*x-b||_2 and ||x||_2 is
           x =  V1 * Sigma1^(-1) * U1^T * b ,  i.e. the choice y = 0

    Def: A^+ = V1 * Sigma1^(-1) * U1^T is the Moore-Penrose pseudo-inverse of the rank-r
         matrix A (includes the full rank case n=r).

    So for square or not, full rank or not, the "best" solution to 
    minimize ||A*x-b||_2 x is x = A^+*b

    Proof: || A*x-b ||_2^2 =  || U*Sigma*V^T*x - b||_2^2
                           =  || Sigma*V^T*x - U^T*b ||_2^2
                           =  || [ Sigma1 * V1^T * x - U1^T * b ] ||_2^2
                              || [                   - U2^T * b ] ||
                              || [                   - U3^T * b ] ||
                           =  || Sigma1 * V1^T * x - U1^T * b ||_2^2 
                              + || U2^T * b ||_2^2 + || U3^T * b ||_2^2
           Now write x = V*V^T*x = [V1,V2]*[V1^T;V2^T]*x = [V1,V2]*[V1^T*x;V2^T*x] 
                                 = [V1,V2]*[y1;y2] = V1*y1 + V2*y2
           Note that V2*y2 is in the null space of A. Substituting for x, we get
            || A*x-b ||_2^2 = || Sigma1 * V1^T * (V1*y1 + V2*y2) - U1^T * b ||_2^2
                              + || U2^T * b ||_2^2 + || U3^T * b ||_2^2
                            = || Sigma1 * (y1) - U1^T * b ||_2^2
                              + || U2^T * b ||_2^2 + || U3^T * b ||_2^2
           which is clearly minimized by y1 = Sigma1^(-1) * U1^T * b and any y2, i.e.
              x = V1*y1 + V2*y2 = V1 * Sigma1^(-1) * U1^T * b + V2*y2 for any y2.
           Since ||x||_2^2 = || [y1;y2] ||_2^2 = || y1 ||_2^2 + || y2 ||_2^2,
           ||x||_2^2 is minimized by the choice y2=0, as claimed. QED

Solving a Least Squares Problem when A is (nearly) rank deficient, with the SVD

  We have defined the the condition number of a matrix as sigma_max / sigma_min,
  where sigma_min = 0 for a rank deficient matrix, so the condition number is
  formally infinite. To illustrate this, compare the (minimum norm) 
  least square solutions of
     argmin || [ 1 0 ] * [ x1 ] - [ 1 ] ||_2 = [ 1 ]
            || [ 0 0 ]   [ x2 ]   [ 1 ] ||     [ 0 ]
            || [ 0 0 ]            [ 1 ] ||
  and
     argmin || [ 1 0 ] * [ x1 ] - [ 1 ] ||_2 = [ 1   ]
            || [ 0 e ]   [ x2 ]   [ 1 ] ||     [ 1/e ]
            || [ 0 0 ]            [ 1 ] ||
  which is arbitrarily different as e approaches 0.
  Given this sensitivity, it is natural to ask what sense it can
  make to solve a rank deficient least squares problem, when the
  solution can change discontinuously? In particular, we usually
  only know A to within some tolerance tol, i.e. the true A'
  could satisfy || A - A' || <= tol. When A is nearly singular,
  what should we do?

  Def: Given A and some tolerance tol bounding the uncertainty in A,
  the "truncated SVD" of A is defined as A = U * Sigma(tol) * V^T,
  where A = U*Sigma*V^T is the usual SVD, and
    Sigma(tol) = diag(sigma_i(tol)), where
       sigma_i(tol) = { sigma_i if sigma_i >= tol
                      {    0    if sigma_i < tol
  In other words, we "truncate" all the singular values smaller than 
  tol to zero. Or, we replace A by the lowest rank matrix within a
  distance tol (measured by the 2-norm) of A. 

  Using the truncated SVD effectively replaces the usual condition 
  number sigma_max(A)/sigma_min(A) by sigma_max(A)/tol, which can
  be much smaller, depending on the choice of tol. In other words,
  tol is a "knob" one can turn to trade off conditioning with how
  closely A*x can approximate b (raising tol, and so lowering
  the rank of A, decreases the dimension of the space can can be
  approximated by A*x). Replacing A by an "easier" matrix is also
  called "regularization" (using the truncated SVD is just one
  of several mechanisms).

  Lemma: The difference between  
          x1 = argmin_x ||A'*x-b1||_2 and x2 = argmin_x ||A'*x-b2||_2
       (where we choose the solution of smallest norm in both cases)
       is bounded by ||b1-b2||_2 / tol.
       So choosing a larger tol limits the sensitivity of the solution.
    Proof: Let A = U*Sigma*V^T and A' = U*Sigma'*V^T as above. Then
           || x1-x2 ||_2 = || (A')^+(b1-b2) ||_2 = || V*(Sigma')^+*U^T*(b1-b2) ||_2
           = || diag(1/sigma(1) , 1/sigma(2) , ... , 1/sigma(k), 0 ... 0 ) *U^T*(b1-b2) ||_2
                where sigma(k) > tol
           <= (1/sigma(k)) * || U^T*(b1-b2) ||_2
           <= (1/tol) * || b1-b2 ||_2

Solving a Least Squares Problem when A is (nearly) rank deficient, with QR

   As before, the SVD makes everything easy, but it is expensive, several times
   as much as QR (depends on your computer), so a cheaper alternative to the
   truncated SVD is the QR decomposition with column pivoting:
   Suppose we did A = QR exactly, with A of rank r < n; what would R look like?
   If the leading r columns of A were full rank (true for "most" such A), then
     R = [ R11 R12 ] with R11 r x r and full rank, so R22 = 0.
         [  0  R22 ]
   If A nearly low-rank, we can hope for  || R22 || < tol, and set it to zero.
   Assuming that this works for a moment,
   write A = QR = [Q,Q']*[ R ] with [Q,Q'] = [Q1,Q2,Q'] square and orthogonal
                         [ 0 ]
   as before, with Q1 m x r, Q2 m x (n-r) and Q' m x (m-n)
   Thus argmin_x || Ax-b ||_2 = argmin_x || [Q1,Q2,Q']*[ R ] * x - b ||_2
                                                       [ 0 ]
                              = argmin_x || [ R11 R12 ] * [ x1 ] - [ Q1^T*b ] ||_2
                                            [  0   0  ]   [ x2 ]   [ Q2^T*b ]
                                            [  0   0  ]            [ Q'^T*b ]
         = argmin_x  || [ R11*x1 + R12*x2 - Q1^T*b ] ||_2
                        [                 - Q2^T*b ]
                        [                 - Q'^T*b ]
  with solution x1 = inv(R11)*(Q1^T_b - R12*x2) for any x2
  But how do we pick x2 to minimize || x ||_2?
  and is the solution as smooth as desired, i.e. could cond(R11) be >> 1/tol?

  Ex: A = .5*eye(n) - diag(ones(n-1,1),1) is already upper triangular,
      so A = QR = R. But cond(A) ~ 2^n, (inv(A))(i,j) = 2^(j-i+1) above diagonal)
      so at least one singular value is ~ 2^(-n), and should be set to zero when n is large.
      But if we order the columns of A to be Ap = A(:,[2:n,1]) and then do QR, we
      get Ap = QR with
         R22 = R(n,n) ~ 2(-n)
         the other R(i,i) = O(1), and R11 = R(1:n-1,1:n-1) with cond(R11) = O(1),
         R12 = R(1:n-1,n) with norm(R12) = O(1)
      So choosing x2 = 0 and x1 = inv(R11)*Q1^T*b is a solution with small norm,
      as desired.
      (matlab demo)

  How to choose permutation in general?
  Algorithm called column pivoting: at each step, look at the remaining columns
    not yet reduced to upper triangular form, and pick the one with the biggest norm
  Available in Matlab as [Q,R,P]=qr(A) (permutation returned in A), which uses
  LAPACK routine (dgeqrp.f)