Notes for Ma221 Lecture 16, Oct 19 2010

Begin Chapter 4 on eigenvalue problems

Goals: 
   Canonical Forms (recall Jordan, why we want Schur instead)
   Variations on eigenproblems (not always just one matrix!)
   Perturbation Theory (can I trust the answer?)
   Algorithms (for a single nonsymmetric matrix)

Recall basic definitions:

   Def: p(lambda) = det(A - lambda*I) is the characteristic polynomial,
   whose n roots are the eigenvalues of A
  
   Def: If lambda is an eigenvalue, a nonzero null vector x satisfying (A-lambda*I)*x = 0,
   must exist, i.e. A*x = lambda*x, and is called a right eigenvector. Analogously
   a nonzero null vector y^* must exist such that y^* * A = lambda*y^*, and is called
   a left eigenvector.

   Def: If S is nonsingular, and B = S*A*inv(S), then S is called a similarity transformation,
   and A and B are called similar matrices.

   Lemma: If A and B are similar, they have the same eigenvalues, and the eigenvectors
   are related by multiplying by S:
   Proof:  A*x = lambda*x iff S*A*inv(S)*S*x = S*lambda*x or B*(S*x) = lambda*(S*x)
           ie. iff lambda is also an eigenvalue of B and S*x is a right eigenvector of B
           Analogously, y^* * A = lambda * A^* iff y^* * inv(S)*S*A*inv(S) = lambda * y^* * inv(S)
           or (y^* * inv(S)) * B = lambda * ( y^* * inv(S)), i.e. iff y^* * inv(S) is
           a left eigenvector of B.

   Our goal will be to take A and transform it to a simpler similar form B, from which
   its eigenvalues and eigenvectors are easy to extract. The simplest form, from which 
   eigenvalues and eigenvectors are obvious, is a diagonal matrix D, since D*e(i) = D(i,i)*e(i),
   where e(i) is the i-th column of I.

   Lemma: Suppose A*x(i) = lambda(i)*x(i) for i=1 to n, and that the matrix S = [x(1),...,x(n)]
   is nonsingular, i.e. there are n linearly independent eigenvectors. Then
    A = S*diag(lambda(1),...,lambda(n))*inv(S). Conversely, if A = S*Lambda*inv(S) where
    Lambda is diagonal, then the columns of S are eigenvectors and the Lambda(i,i) are eigenvalues.
   Proof: A = S*Lambda*inv(S) if and only if A*S = S*Lambda if and only if
          the i-th columns of both sides are the same, i.e.  A*S(:,i) = S(:,i)*Lambda(i,i)

   But we can't always make B = S*A*inv(S) diagonal, for two reasons:
     It may be mathematically impossible (recall Jordan form, when there are multiple eigenvalues)
     It may be numerically unstable (even if the Jordan form is diagonal)

   Recall Jordan Form: for any A there exists a similar matrix J = S*A*inv(S) such that
    J = diag(J_1,...,J_k) where each J_i is a Jordan block:
          J_i = [ lambda   1   0   ...       0  ]
                [    0  lambda 1 0 ...       0  ]
                [        ...                    ]
                [    0   ...      lambda    1   ]
                [    0   ...         0   lambda ]
   Up to permuting the order of the J_i, the Jordan form is unique.
   Different J_i can have the same eigenvalue lambda (eg A = I).
   There is only one (right) eigenvector per J_i (namely [ 0, ... , 0,1,0, ... 0]
   with the 1 in the same row as the top row of J_i)).
   So a matrix may have n eigenvectors (if there are n J_i; the matrix is called
   diagonalizable in this case) or fewer (in which case it is called defective).
   The number of times one value of lambda appears on the diagonal is called its multiplicity.

   But we will not compute the Jordan form, for numerical reasons (though algorithms do exist):
    Consider the following slightly perturbed 2x2 identity matrices: what are their
    eigenvalue and eigenvectors?
    (1)   [ 1 0   ]  :  ( 1, [ 1 ] ) and ( 1+e, [ 0 ] )        ... simplest case
          [ 0 1+e ]          [ 0 ]              [ 1 ] )
    (2)   [ 1 e ]    :  ( 1+e, [ 1 ] ) and ( 1-e, [  1 ] )     ... rotated 45 degrees
          [ e 1 ]              [ 1 ]              [ -1 ] ) 
    (3)   [ 1   e   ]    :  ( 1, [ 1 ] ) and ( 1+e^2, [ 1 ] )  ... nearly parallel
          [ 0 1+e^2 ]            [ 0 ]                [ e ] )
    (4)   [ 1 e ]    :  ( 1, [ 1 ] )                           ... only one
          [ 0 1 ]            [ 0 ]
    (5)   [ 1 0 ]    :  ( 1, anything )  and (1, anything )    ... not unique
          [ 0 1 ]            
   This means that when there are (nearly) multiple eigenvalues, the Jordan Form
   is very ill-conditioned (or may be exist or be unique), which makes computing
   it fraught with numerical peril.

   The best we can generally hope for, as in earlier chapters, is backwards stability:
   Getting exactly the right eigenvalues and similarity S for a slightly perturbed
   input matrix A + E, where || E || = O(macheps)*|| A || 

   In the last chapter we said that as long as you multiply a matrix by orthogonal
   matrices, it is backward stable, i.e. 
     fl( Q(k)*Q(k-1)*...*Q(1)*A ) = Q(A+E)  where Q is exactly orthogonal, and 
                || E || = O(macheps)*|| A ||
   If we apply this to computing an orthogonal similarity transformation, we get
    fl ( Q(k)*Q(k-1)*...*Q(1) * A * Q(1)^T*...*Q(k-1)^T*Q(k)^T ) = Q*(A+E)*Q^T
   i.e. the exact orthogonal similarity of the slightly perturbed input A+E.

   This means that if we can restrict the similarity transforms S we use to
   try to make S*A*inv(S) simple to be orthogonal, we get backwards stability.
   So the question is: if we restrict S to be orthogonal, how close to Jordan form 
   can we get? 

   Theorem (Schur Canonical Form): Given any square A there is a unitary Q such that 
   Q^* * A * Q = T is upper triangular. The eigenvalues can be made to appear on the diagonal 
   of T in any order.

   Note that eigenvectors are easy to compute from the Schur form if you need them: 
   T*x = T(i,i)*x turns into solving a triangular system:
      [ T11 T12    T13 ] * [ x1 ] = T(i,i) * [ x1 ] means T11*x1 + T12*x2 + T13*x3 = T(i,i)*x1
      [  0  T(i,i) T23 ]   [ x2 ]            [ x2 ]             T(i,i)*x2 + T23*x3 = T(i,i)*x2
      [  0  0      T33 ]   [ x3 ]            [ x3 ]                         T33*x3 = T(i,i)*x3
   If there is only one copy of eigenvalue T(i,i), then the only solution of 
   T33*x3 = T(i,i)*x3 is x3=0.  Then T(i,i)*x2 = T(i,i)*x2 has any solution x2, we pick x2 = 1.
   Finally we solve (T11-T(i,i)*I)*x1 = -T12, a triangular system for x1.
   If T(i,i) is a multiple eigenvalue, then T11 - T(i,i)*I might be singular, so we might
   not be able to solve (T11-T(i,i)*I)*x1 = -T12, as expected.

   Proof of Theorem: We use induction: Let x be a right eigenvector with ||x||_2 = 1, and
   Let Q = [x,Q'] be any unitary matrix with x as its first column. Then
     Q^* * A * Q = [ x^*  ] * A * [ x, Q' ] = [ x^* * A * x   x^* * A * Q' ]
                   [ Q'^* ]                   [Q'^* * A * x   Q'^* * A * Q']
                 = [ lambda * x^* x     x^* * A * Q'  ] = [ lambda   x^* * A * Q'  ]
                   [ lambda * Q'^* * x  Q'^T * A * Q' ] = [   0      Q'^T * A * Q' ]
   Now we apply induction to the smaller matrix Q'^* * A * Q' to write it as U^* * T * U
   where T is upper triangular and U is unitary. so
    Q^* * A * Q = [ lambda   x^* * A * Q' ] = [ 1  0 ] * [ lambda  x^* * A * Q' * U^* ] * [ 1 0 ]
                  [   0      U^* * T * U  ]   [ 0 U^*]   [   0             T          ]   [ 0 U ]
    or   [ 1 0 ] * Q^* * A * Q * [ 1 0   ] = [ lambda  stuff ] as desired
         [ 0 U ]                 [ 0 U^* ]   [    0      T   ]

   But there is still an obstacle: Real matrices can have complex eigenvalues (unless, say, they 
   are symmetric, the topic of Chap 5). So T may have to be complex even if A is real; we'd
   prefer to keep arithmetic real if possible, for various reasons (reduce #flops, less memory,
   make sure complex eigenvalues come in conjugate pairs despite roundoff).
   So instead of triangular T, we will settle for block triangular T:
    T = [ T11 T12 ... T1k ]
        [  0  T22 ... T2k ]
        [      ...        ]
        [  0   0  ... Tkk ]
   The eigenvalues of T are just the union of the eigenvalues of all the Tii (Q 4.1).
   We will show that we can reduce any real A to such a block triangular T where
   each Tii is either 1x1 (so a real eigenvalue) or 2x2 (with two complex conjugate eigenvalues).

   Theorem (Real Schur Canonical Form) Given any real square A, there is a a real orthogonal Q
   such that Q*A*Q^T is block upper triangular with 1x1 and 2x2 blocks.

   To prove this we need to generalize the notion of eigenvector to "invariant subspace":

   Def: Let V = span{x1,...,xm} = span(X) be a subspace of R^n. It is called an invariant subspace
   if A*V = span(A*X) is a subset of V.

   Ex:  V = span{x} = {alpha*x for any scalar alpha}  where A*x = lambda*x, then 
        A*V = {A*(alpha*x) for any alpha} = {alpha*lambda*x for any alpha} is in span(x)=V
        (when would A*V not equal V?)
   Ex:  V = span{x(1),...,x(k)} ={sum_{i=1 to k} alpha(i)*x(i) for any scalars alpha(i)}
        where A*x(k) = lambda(k)*x(k), then
        A*V = {A*sum_{i=1 to k} alpha(i)*x(i) for any alpha(i) }
            = {sum_{i=1 to k} A*alpha(i)*x(i) for any alpha(i) } 
            = {sum_{i=1 to k} alpha(i)*lambda(i)*x(i) for any alpha(i) }
        is a subset of V (when would A*V not equal V?)

  Lemma: If V = span{x(1),...,x(m)} = span{X} is an n-dimensional invariant subspace of A
         (i.e. x(1),...,x(m) are independent) then there is an mxm matrix B such that
         A*X = X*B. The eigenvalues of B are also eigenvalues of A.
  Proof: The existence of B follows from the definition of invariant subspace:
         A*x(i) in V means there are scalars B(1,i),...,B(m,i) (the i-th column of B) such that
         A*x(i) = sum_{j=1 to m} x(j)*B(j,i). If B*y = lambda*y, then
         A*X*y = X*B*y = X*y*lambda, so X*y is an eigenvector of A with eigenvalue lambda.

  Lemma: Let V = span{X} be an m-dimensional invariant subspace of A as above, with A*X=X*B.
         Let X = Q*R, and let [Q,Q'] be square and orthogonal. Then
           [Q,Q']^T * A * [Q,Q'] = [ A11  A12 ] is block upper triangular
                                   [  0   A22 ]
         with A11 = R * B * inv(R) having the same eigenvalues as B.
  Proof: [Q,Q']^T * A * [Q,Q'] = [ Q^T * A * Q   Q^T * A * Q' ] = [ A11 A12 ]
                                 [ Q'^T* A * Q   Q'^T* A * Q' ]   [ A21 A22 ]
         where A * Q = A * X * inv(R) = X * B * inv(R) = Q * R * B * inv(R)
         so  A11 = Q^T * Q * R * B * inv(R) = R * B * inv(R)
         and A21 = Q'^T * Q * R * B * inv(R) = 0

  Proof of Real Schur Form: We use induction as before. If A*x = lambda*x where
         lambda and x are real, we reduce to a smaller problem as before.
         If lambda and x are complex, write the real and imaginary parts of 
         A*x = lambda*x
         separately as  [A*Re(x),A*Im(x)] = A*[Re(x),Im(x)]
                                          = [ Re(lambda)*Re(x) - Im(lambda)*Im(x),
                                              Re(lambda)*Im(x) + Im(lambda)*Re(x) ]
                       = [ Re(x), Im(x) ] * [ Re(lambda)  Im(lambda) ]
                                            [-Im(lambda)  Re(lambda) ]
          or A*X = X*B where X = [ Re(x), Im(x) ] and B = [ Re(lambda) Im(lambda) ]
                                                          [-Im(lambda) Re(lambda) ]
          We may confirm that the two eigenvalues of B are lambda and conj(lambda),
          i.e. the equation A*X = X*B represents the information that both (lambda,x)
          and (conj(lambda),conj(x)) are "eigenpairs".
          So by the Lemma we can do an orthogonal similarity on A to get [ A11 A12 ]
                                                                         [  0  A22 ]
          where the eigenvalues of A11 are lambda and conj(lambda), completing the induction.

Typical sources of eigenvalue problems: ODEs

   y'(t) = K*y(t)  =>  try y(t) = exp(lambda*t)*y0, getting
   lambda*exp(lambda*t)*y0 = K*e^(lambda*t)*y0 or
   K*y0 = lambda*y0, so that y0 is an eigenvector and lambda an eigenvalue

   If we are given y(0), and can write it as a linear combination of eigenvectors
   y(0) = sum_i alpha(i)*x(i) where A*x(i) = lambda(i)*x(i), then
   let y(t) = sum_i alpha(i)*exp(lambda(i)*t)*x(i), and confirm that
       y(0) = sum_i alpha(i)*1*x(i) 
   and y'(t) = sum_i alpha(i)*exp(lambda(i)*t)*lambda(i)*x(i)
             = sum_i alpha(i)*exp(lambda(i)*t)*K*x(i)
             = K*y(t) as desired.
   Another way to write this is that 
      y(t) = exp(tA)*y(0)
           = sum_{i=0 to inf} (tA)^i/i! * y(0)
           = sum_{i=0 to inf} (tS*Lambda*S^(-1) )^i/i! * y(0)   ... if A diagonalizable
           = sum_{i=0 to inf} S*(t*Lambda)^i*S^(-1)/i! * y(0)   
           = S * ( sum_{i=0 to inf} (t*Lambda)^i /i! ) * S^(-1) * y(0) 
           = S * diag ( sum_{i=0 to inf} (t*lambda(j) )^i /i! ) * S^(-1) * y(0) 
           = S * diag ( exp(t*lambda(j)) ) * S^(-1) * y(0) 
    This is the same as above, where S = [x(1),...,x(n)] and S^(-1)*y(0) = [alpha(1),...,alpha(n)]
  
   One can also use the Schur Form (Question 4.4)
   When K does not have n independent eigenvectors, the general solution uses the Jordan form
   (Section 4.5.1 in text).

   Now consider M*y''(t) = -K*y(t),  where M is usually called the mass matrix and K the stiffness
   matrix, (or M could be inductances in a circuit, and K reciprocals of capacitances).
   Proceeding as before with y(t) = exp(lambda*t)*y0 we get
   lambda^2*M*y0 = -K*y0, or (lambda^2*M + K)*y0 = 0. We call -lambda^2 a "generalized eigenvalue"
   and x0 a "generalized eigenvector" of the pair of matrices K and M. The characteristic
   polynomial det(K - z*M) now depends on 2 matrices. There is a generalization of
   the Jordan form to pairs of matrices, called Weierstrass form, and of Schur form, to
   deal with this.

   Now consider M*y''(t) = -B*y'(t) - K*y(t), where B is the damping matrix (or resistances in
   a circuit). Again trying y(t) = exp(lambda*t)*y0 we get
    (lambda^2*M + lambda*B + K)*y0 = 0, which yields a "nonlinear eigenvalue" problem.
   We may convert to a linear one by rewriting it as
      [ M 0 ] * [ y'(t) ]' = -[ B K ] * [ y'(t) ] or [ M 0 ] * z'(t) = -[ B K ] * z(t)
      [ 0 I ]   [ y(t)  ]     [ I 0 ]   [ y(t)  ]    [ 0 I ]            [ I 0 ]
   Plugging in z(t) = exp(lambda*t)*z0 we proceed as before.

   More generally, all the ideas of this chapter (eigenvalues, eigenvectors, Jordan form,
   Schur form) extend to more general eigenvalue problems with 2 or more matrices,
   that do not even have to be square ("rectangular" eigenproblem arise naturally in
   systems and control problems); we will sketch these extensions later.

Perturbation Theory: How can I trust my answer?

   Recall that the best we can hope for is backward stability: right answer (eigenvalues)
   for a slightly wrong problem A + E, where ||E|| = O(macheps)*||A||. How much can
   this change the eigenvalues and vectors?

   Last time: showed that if eigenvalues close together, eigenvectors can be very sensitive
   (or disappear, or be nonunique, as for I). How about the eigenvalues?

   To describe perturbations we consider 
   Def: The epsilon pseudo-spectrum of A is the set of all eigenvalues of all matrices
    with distance epsilon of A: 
    Lambda_eps(A) = {lambda: (A+E)x = lambda*x for some nonzero x and some ||E||_2 <= eps}

  Ideal case: Lambda_eps(A) = union of disks of radius eps centered at eigenvalues of A
    Will show this is true for A = A^* (Chapter 5).
  Worst case: Thm (Trefethen & Reichel): Given any simply connected region R in the 
  complex plane, and point x in R, and any eps>0, there is an A with one eigenvalue at x
  such that Lambda_eps(A) is as close to filling out R as you like. (picture)

  Example: Perturb nxn Jordan block at 0 by changing J(n,1) = eps, get eigenvalues on circle
   of radius eps^(1/n), which is >> eps. This example shows 
  (1) that eigenvalues are not necessarily differentiable functions of matrix entries 
      (slope of eps^(1/n) is infinite at eps=0), although they are continuous 
      (and differentiable when not multiple)
  (2) gives intuition that we should expect a sensitive eigenvalue when it is (close to) 
      multiple, as was the case for eigenvectors.

  Let us find the condition number of a simple (nonmultiple) eigenvalue:
  Thm: Let lambda be a simple eigenvalue, with A x = lambda x and y^* A = lambda y^*,
  and ||x||_2 = ||y||_2 = 1.  
  If we perturb A to A + E the lambda is perturbed to lambda + dlambda, and
     dlambda  = (y^* E x) / y^* x + O(||E||^2)
    |dlambda|<= ||E|| / |y^* x| + O(||E||^2) = sec(theta) * ||E|| + O(||E||)^2
  where theta is the acute angle between x and y. So sec(theta) is the condition number of lambda
  Proof: Subtract A x = lambda x from (A+E)(x+dx) = (lambda + dlambda)(x+dx) to get
          A dx + E x + E dx = lambda dx + dlambda x + dlambda dx
  Ignore second order terms E dx and dlambda dx, and multiply by y* to get
        y^* A dx + y^* E x = lambda y^* dx + dlambda y^* x
  Cancel y^*A dx = lambda y^* dx to get
        y^* E x = dlambda y^* x 
  as desired.
  Note that a Jordan block has x = e(1) and y = e(n), so y^* x = 0 as expected.

  An important special case are real symmetric matrices (or more generally normal matrices,
  where A^* A = A A^*), since these have orthogonal eigenvectors:
  Corollary: If A is normal, perturbing A to A+E means | dlambda | <= ||E|| + O(||E||^2)
  Proof: A = Q Lambda Q^* is the eigendecomposition, where Q is unitary, so
   A*Q = Q*Lambda, and the right eigenvectors are the columns of Q, and
  Q^* A = Lambda Q^*, and the left eigenvectors are also the columns of Q, so x = y.

  Later, in Chapter 5, for real symmetric matrices A=A^T (or more generally complex
  Hermitian matrices A = A^*), we will show that if E is also symmetric, then
  | dlambda | <= ||E|| no matter how big ||E|| is

  The above theorem is true for small ||E||. It is possible to change it slightly to
  work for large E:

  Thm (Bauer-Fike): Let A have all simple eigenvalues (i.e. be diagonalizable).
  Call them lambda_i with right and left eigenvectors x_i and y_i, normalized so
  ||x_i||_2 = ||y_i||_2 = 1. Then for any E the eigenvalues of A+E like in the union of disks 
  D_i in the complex plane, where D_i has center lambda_i and radius n||E||_2 / |y_i^* x_i|.

  Note that this is just n times larger than the last theorem. Also note that if two disks
  D_i and D_j overlap, all the theorem guarantees is that there are two eigenvalues of A+E
  in the union D_i U D_j (the same idea applies if more disks overlap). (see book for proof).
  
Algorithms for the Nonsymmetric Eigenproblems

  Our ultimate algorithm, the Hessenberg QR algorithm, takes a nonsymmetric A
  and computes the Schur form A  = Q T Q^*, in O(n^3) flops point operations.
  We will build up to it with simpler algorithms, that will also prove useful 
  as building blocks for the algorithms for sparse matrices, where we only
  want to compute a few eigenvalues and vectors. The Hessenberg QR algorithm
  will also be used as a building block for large sparse matrices, because
  our algorithms for them will approximate them (in a certain sense) by
  much smaller dense matrices, to which we will apply Hessenberg QR.

  The plan is as follows:
    Power Method: Just repeated multiplication of a vector by A; we'll show
      this makes the vector converge to the eigenvector for the eigenvalue
      of largest absolute value, which we also compute
    Inverse Iteration: Apply power method to B = (A - sigma*I)^(-1), which has
      the same eigenvectors as A, but now the largest eigenvalue in absolute
      value of B corresponds to the eigenvalue of A closest to sigma (which is 
      called the "shift").  By choosing sigma appropriately, this lets us get 
      any eigenvalue of A, not just the largest.
    Orthogonal Iteration: This extends the power method from one vector to
      compute a whole invariant subspace
    QR iteration: we combine Inverse Iteration and Orthogonal Iteration to
      get our ultimate algorithm.

   There are a lot of other techniques needed to make QR iteration efficient
   (run in O(n^3)) and reliable, as well as to reduce data movement. We will
   only discuss some of these.

  Power Method: given x(0), we iterate
    i=0
    repeat
       y(i+1) = A*x(i)
       x(i+1) = y(i+1) / ||y(i+1)||_2        ... approximate eigenvector
       lambda'(i+1) = x(i+1)^T * A * x(i+1)   ... approximate eigenvalues
       i = i+1
    until convergence

    We first analyze convergence when A = diag(lambda(1),...,lambda(n)) 
    where |lambda(1)| > |lambda(2)| >= |lambda(3)| >= ...
    and then generalize:

    We note that x(i) = A^i * x(0) / || A^i * x(0) ||_2
    and that A^i * x(0) = [ lambda(1)^i * x(0)_1 , lambda(2)^i * x(0)_2 , ... ]
                        = lambda(1)^i [ x(0)_1 , (lambda(2)/lambda(1))^i * x(0)_2 , ... ]

     so x(i) = [ x(0)_1 , (lambda(2)/lambda(1))^1 * x(0)_2 , ...] / || " ||_2
     As i grows, each (lambda(j)/lambda(1))^i converges to 0, and x(i) converges to  [1,0,...,0]
     as desired, with error O(|lambda(2)/lambda(1)|^i)

     More generally, suppose A is diagonalizable, with A = S Lambda S^(-1), so
     A^i = S Lambda^i S^(-1). Let z = S^(-1) * x(0), so
       A^i * x(0) = S * [lambda(1)^1 * z_1 , lambda(2)^i * z_2 , ... ]
                  = lambda(1)^i [ z_1 * S(:,1) + (lambda(2)/lambda(1))^i * z_2 * S(:,2) + ...]
     As i increases, the vector in [ ] converges to z_1 * S(:,1), i.e. a multiple of the
     first column of S, i.e. since A*S = S*Lambda, the eigenvector of S for lambda(1), as desired.

     For this to converge to the desired eigenvector at a reasonable rate, we need
     (1)  |lambda(2)/lambda(1)| < 1, and the smaller the better. This is not necessarily
          true, and for an orthogonal matrix A A^T = I, all the eigenvalues have
          absolute value 1 (since ||x|| = ||A*x|| = ||lambda*x||), so there is no convergence.
     (2)  z_1 nonzero, and the larger the better. If we pick x(0) at random, it is very unlikely
          that z_1 will be very tiny, but there are no guarantees.

     To deal with needing |lambda(1)| >> |lambda(2)| to get convergence, we use
     inverse iteration, i.e. the power method on B = (A - sigma*I)^(-1), where sigma
     is called the shift. 

   Inverse iteration: given x(0), we iterate
    i=0
    repeat
       y(i+1) = (A-sigma*I)^(-1)*x(i)
       x(i+1) = y(i+1) / ||y(i+1)||_2        ... approximate eigenvector
       lambda'(i+1) = x(i+1)^T * A * x(i+1)   ... approximate eigenvalues
       i = i+1
    until convergence


     The eigenvectors of B are the same as those of A, but its
     eigenvalues are 1/(lambda(i) - sigma). Suppose sigma is closer
     to lambda(k) than any other eigenvalue of A. Then the same
     kind of analysis as above shows that x(i) is gotten by the taking
     the following vector divided by its norm:
          [ ((lambda(k) - sigma)/(lambda(1) - sigma))^i * z(1)/z(k) ]
          [ ((lambda(k) - sigma)/(lambda(2) - sigma))^i * z(2)/z(k) ]
          [                         ...                             ]
          [                          1                              ]  ... k-th component
          [                         ...                             ]
          [ ((lambda(k) - sigma)/(lambda(n) - sigma))^i * z(n)/z(k) ]
     So if we can choose sigma much closer to lambda(k) than any other lambda(i),
     we can make convergence as fast as we want. Where do we get sigma?
     Once we start converging, the algorithm itself computes an improving
     estimate lambda'(i) of the eigenvalue; we will see later that 
     this makes convergence very fast, quadratic or even cubic in some cases.

   The next step is to compute more than one vector at a time.
   We do this first for the analogue of the power method:

   Orthogonal Iteration: given Z(), and n x p orthogonal matrix, we iterate
    i = 0
    repeat
       Y(i+1) = A*Z(i)
       factor Y(i+1) = Z(i+1)*R(i+1)  ... QR decomposition
              ... Z(i+1) spans an approximate invariant subspace
       i=i+1
    until convergence

    Here is an informal analysis, assuming  A = S*Lambda*S^(-1) is diagonalizable and
      | lambda(1) | >= | lambda(2) | >= ... >= | lambda(p) | > | lambda(p+1) | >= ...
    i.e. the first p eigenvalues are larger in absolute value than the others.
    Note that 
       span{Z(i+1)} = span{Y(i+1)} = span{A*Z(i)} = ... 
     = span{A^i*Z(0)} by induction
     = span(S*Lambda^i*S^(-1)*Z(0))
   Now
       Lambda^i * S^(-1) * Z(0)
     = lambda(p)^i * 
       diag((lambda(1)/lambda(p))^i , ... , 1 , (lambda(p+1)/lambda(p))^i , ...) * S^(-1) * Z(0)
     =  lambda(p)^i [ V(i) ]    p x p
                    [ W(i) ]  n-p x p
   where (lambda(i)/lambda(p))^i goes to infinity if i < p and goes to 0 if i > p.
   Thus W(i) goes to zero, and V(i) does not. If V(0) has full rank, so will V(i). Thus
   A^i*Z(0) = lambda(p)^i * S * [ V(i) ] approaches lambda(p)^i * S * [ V(i) ]
                                [ W(i) ]                              [  0   ]
   so it is a linear combination of the first p columns of S, i.e. the first p eigenvectors,
   the desired invariant subspace.

   Note that the first k < p columns of Z(i) are the same as though we had run the
   algorithm starting with the first k columns of Z(0), because the k-th column of
   Q and R in the QR decomposition of A=QR only depend on columns 1:k of A.
   In other words, Orthogonal Iteration runs p different iterations simultaneously, 
   with the first k columns of Z(i) converging to the invariant subspace spanned by 
   the first k eigenvectors of A.

   Thus we can let p = n and Z(0) = I, and try to compute n invariant subspaces
   at the same time. This will give us Schur Form:
  
   Theorem: Run Orthogonal iteration on A with Z(0) = I. If all the eigenvalues of
   A have different absolute values, and if the principal submatrices S(1:k,1:k) of
   the matrix of eigenvectors of A all have full rank, then A(i) = Z(i)^T * A * Z(i)
   converges to Schur form, i.e. diagonal with eigenvalues on the diagonal

   In Matlab (demo), try 
      n=6, D = diag(.5.^[1:n]), S = randn(n,n), A = S*D*inv(S), Z = eye(n);
   and repeat
     Y = A*Z; [Z,R]= qr(Y); Z'*A*Z

   Proof: By the previous analysis, for each k, the span of the first k columns of Z(i) 
    converge to the invariant subspace spanned by the first k eigenvectors of A.
    Write Z(i) = [ Z(i)_1 , Z(i)_2 ] where Z(i)_1 has k columns so
      Z(i)^* * A * Z(i) = [ Z(i)_1^* * A * Z(i)_1  Z(i)_1^* * A * Z(i)_2 ]
                          [ Z(i)_2^* * A * Z(i)_1  Z(i)_2^* * A * Z(i)_2 ]
    Now A * Z(i)_1 converges to Z(i)_1 * B(i) since Z(i)_1 is converging to an invariant subspace,
    so Z(i)_2^* * A * Z(i)_1 converges to Z(i)_2^* * Z(i)_1 * B(i) = 0. 
    Since this is true for every k, Z(i)^* * A * Z(i) converges to upper triangular form T(i).
    Since Z(i) is unitary, this is the Schur form.
   
 The next step is to rewrite Orthogonal Iteration as QR iteration, which
 will let us incorporate inverse iteration, and so converge rapidly to
 any eigenvalue for which we have a good approximation (which will be supplied
 by the algorithm!).

 QR Iteration: Given A(0) = A we iterate
   i = 0
   repeat
     factor A(i) = Q(i)*R(i)   ... QR decomposition
     A(i+1) = R(i)*Q(i)
     i = i + 1
   until convergence

  Note that A(i+1) = R(i)*Q(i) = Q(i)^T * Q(i) * R(i) * Q(i) = Q(i)^T * A(i) * Q(i)
  so that all the A(i) are orthogonally similar.

  Thm: A(i) from QR iteration is identical to Z(i)^T*A*Z(i) from Orthogonal iteration,
  starting with Z(0) = I. Thus A(i) converges to Schur Form if all the eigenvalues
  have different absolute values.

  Proof: We use induction: assume A(i) = Z(i)^T*A*Z(i). Then taking one step of
   Orthogonal iteration we write A*Z(i) = Z(i+1)*R(i+1), the QR decomposition.
   Then Z(i)^T*A*Z(i) = Z(i)^T*Z(i+1)*R(i+1) = orthogonal * upper triangular,
   so this must also be the QR decomposition of A(i) (by uniqueness). Then
     Z(i+1)^T * A * Z(i+1) = Z(i+1)^T * A * (Z(i)*Z(i)^T) * Z(i+1)
                           = (Z(i+1)^T * A * Z(i) ) * ( Z(i)^T * Z(i+1) ) 
                           = ( R(i+1) ) * (Z(i)^T * Z(i+1))
                           = R * Q, 
                           = A(i+1)
     where Q*R =  Z(i)^T*A*Z(i), i.e. we have taken one step of QR iteration.

  Now we show how to incorporate inverse iteration:

  QR iteration with a shift: Given A(0) = A, we iterate
    i = 0 
    repeat
      choose a shift sigma(i) near an eigenvalue of A
      factor A - sigma(i)*I = Q(i)*R(i)   ... QR decomposition
      A(i+1) = R(i)*Q(i) + sigma(i)*I
      i = i+1
    until convergence

  Lemma: A(i) and A(i+1) are orthogonally similar.
  Proof: A(i+1) = R(i)*Q(i) + sigma(i)*I 
                = Q(i)^T*Q(i)*R(i)*Q(i) + sigma(i)*I
                = Q(i)^T*(A(i)-sigma(i)*I)*Q(i) + sigma(i)*I
                = Q(i)^T*A(i)*Q(i)

  If R(i) is nonsingular, we can also write
         A(i+1) = R(i)*Q(i) + sigma(i)*I 
                = R(i)*Q(i)*R(i)*R(i)^(-1) + sigma(i)*I 
                = R(i)*(A(i)-sigma(i)*I)*R(i)^(-1) + sigma(i)*I 
                = R(i)*A(i)*R(i)^(-1) 

  If sigma(i) is an exact eigenvalue of A, we claim QR Iteration 
  converges in one step: If A(i) - sigma(i)*I is singular, then
  R(i) is singular, so some diagonal entry of R(i) must be zero.
  Suppose that the last diagonal entry R(i)_nn = 0. Then the
  whole last row of R(i) is zero, so the last row of R(i)*Q(i) is zero,
  so the last row of A(i+1) = R(i)*Q(i) + sigma(i)*I is zero except 
  for A(i+1)_nn = sigma(i) as desired. This reduces the problem to
  one of dimension n-1, namely the first n-1 rows and columns of A(i+1).

  if sigma(i) is not an exact eigenvalue, we declare convergence when
  A(i+1)_n,1:n-1 is small enough. From earlier analysis we expect
  this block to shrink in norm by a factor 
     | lambda(k) - sigma(i) | / min_{j neq k} | lambda(j) - sigma(i) |
  where lambda(k) is the eigenvalue closest to sigma(i).

  Here is how to see that we are implicitly doing inverse iteration.
  First, since  A(i) - sigma(i)*I = Q(i)*R(i), we get 
        Q(i)^* * (A(i) - sigma(i)*I) = R(i)
  so if sigma(i) is an exact eigenvalue, the last row of Q(i)^* times A(i) - sigma(i)*I is zero,
  and so the last column of Q(i) is a left eigenvector of A(i) for eigenvalue sigma(i).
  Now suppose that sigma(i) is just close to an eigenvalue.
  Then  A(i) - sigma(i)*I = Q(i)*R(i),  so
        (A(i) - sigma(i)*I)^(-1) = R(i)^(-1)*Q(i)^T
        (A(i) - sigma(i)*I)^(-1)^* = Q(i)*R(i)^(-1)^*
        (A(i) - sigma(i)*I)^(-1)^* * R(i)^* = Q(i)
  and taking the last column of both sides we get that
        (A(i) - sigma(i)*I)^(-1)^* * e_n and the last column of Q(i)
  are parallel, i.e. the last column of Q(i) is gotten by a step of inverse iteration,
  starting with e_n (the last column of I).

  So where do we get sigma(i) so that it is a good approximate eigenvalue?
  Since we expect A(i)_nn to converge to an eigenvalue, we simply pick sigma(i) = A(i)_nn.
  We show that this in fact yields quadratic convergence, i.e. the error is squared
  at each step, so the number of correct digits doubles. To see why, suppose
   || A(i)_n,1:n-1 || = eps << 1, so that  | A(i)_nn - lambda(k) | = O(eps) for some
  eigenvalue lambda(k), and that the other eigenvalues are much farther away than eps.
  Then by the above analysis, || A(i)_n,1:n-1 || will get multiplied by
     | lambda(k) - sigma(i) | / min_{j neq k} | lambda(j) - sigma(i) | = O(eps),
  and so on the next iteration || A(i+1)_n,1:n-1 || = O(eps^2)

   In Matlab (demo), try 
     format short e,
     n=6, D = diag(.5.^[1:n]), S = randn(n,n), A = S*D*inv(S), Z = eye(n);
   and repeat
     [Q,R] = qr(A); A = R*Q   ... should be same as Orthogonal Iteration before
   and then
     s = A(n,n); [Q,R] = qr(A-s*eye(n)); A = R*Q+ s*eye(n),   ... does it converge quadratically?
     s = A(n,n); [Q,R] = qr(A-s*eye(n)); A = R*Q+ s*eye(n); (s-A(n,n))/s  ... does it converge quadratically?

  To see more explicitly why we get quadratic convergence, at least asymptotically, consider the
  following example in more detail:
    A = randn(6,6); A(6,:)=1e-4*A(6,:); A(6,6)=3;
      ... so the bottom row A(6,1:5) is small, so we are close to convergence
    s=A(6,6); [Q,R]=qr(A-s*eye(6))
      ... note that the last row Q(6,1:5) is of the same magnitude as A(6,1:5)
      ... because of the way QR works, and R(6,6) is similarly small
    R*Q 
      ... note that R(6,6) multiplies Q(6,1:5), squaring their (small) magnitudes,
      ... this is the source of quadratic convergence in the next line
    A = R*Q + s*eye(6)

  If A = A^T, convergence is even faster, cubic, for reasons explained in Chapter 5.

  Making QR iteration practical:
   (1) Each iteration has the cost of QR factorization plus matmul, or O(n^3). Even with
       just a constant number of iterations per eigenvalue, the cost is O(n^4). We want
       a total cost of O(n^3)
   (2) How do we pick a shift to converge to a complex eigenvalue, when the matrix is real,
       and we want to use real arithmetic? In other words, how do we compute the real Schur form?
   (3) How do we decide when we have converged?
   (4) How do we minimize data movement, the way we did for matmul, etc?

  Here are the answers, briefly:
   (1) We preprocess the matrix by factoring it as A = Q*H*Q^T, where Q is orthogonal and
       H is upper Hessenberg, i.e. nonzero only on and above the first diagonal.
       It turns out that QR iteration on H leaves it upper Hessenberg, and
       lets us reduce the cost of one QR iteration from O(n^3) to O(n^2),
       and so the cost of n QR iterations to O(n^3) as desired.
       When A is symmetric, so that H is upper Hessenberg and symmetric, it must be tridiagonal;
       this further reduces the cost of one QR iteration to O(n), and n iterations to O(n^2).
       We discuss this further in Chap 5.
   (2) Since complex eigenvalues of real matrices come in complex conjugate pairs, we
       can imagine taking one QR iteration with a shift sigma followed by one QR iteration
       with shift conj(sigma). It turns out this bring us back to a real matrix, and by
       reorganizing the computation, merging the two QR iterations, we can avoid all complex
       arithmetic.
   (3) When any subdiagonal entry H(i+1,i) is small enough, basically |H(i+1,i)| = O(macheps)*||H||,
       then we set it to zero, since this causes a change no larger than what roundoff does
       anyway. This splits the matrix into two parts, i.e. it is block upper Hessenberg,
       and we can deal with each diagonal block separately. If a block is 2x2 with complex
       eigenvalues, or 1x1, we are done.
   (4) No way is known to reduce the number of words moved to Omega(#flops/sqrt(memory_size)), as
       we could for matmul, LU, and QR, using this algorithm, there is lots of recent research
       in trying to reduce memory traffic by trying to combine many QR iterations and interleave
       there operations in such a way as to get the same answer, but move less data (SIAM Linear
       Algebra Prize 2003, to Byers/Mathias/Braman)
       There are other algorithms that do move as little data as matmul,
       but they do a lot more arithmetic (and will someday be of more practical importance).

  Here are some more details:
   (1) Hessenberg reduction is analogous to QR decomposition: keep multiplying the matrix
       by Householder transformations P to create zeros. But now you have to multiply on the left
       and right:  P*A*P  (since P=P^T) to maintain orthogonal similarity:
         (draw 5 x 5 example)
       The code is analogous:
         for i = 1:n-2  ... zero out matrix entries A(i+2:n,i)
           u = House(A(i+1:n,i))
           A(i+1:n,i:n) = A(i+1:n,i:n) - 2*u*(u^T*A(i+1:n,i:n)) ... multiply A = P*A
           A(1:n,i+1:n) = A(1:n,i+1:n) - 2*(A(1:n,i+1:n)*u)*u^T ... multiply A = A*P
       The cost is (10/3)n^3 + O(n^2) just for A, or (14/3)n^3 + O(n^2) if we multiply
       out the Householder transformations to get Q. This is a lot more than LU or QR,
       and is only the first, cheap phase of the algorithm.

       When A = A^T, then the resulting Hessenberg matrix H = Q*A*Q^T is also symmetric,
       and so is in fact a tridiagonal T. This is called tridiagonal reduction, and 
       is the starting point for solving the symmetric eigenvalue problems (Chap 5).

       For the SVD, we do something similar, but with different orthogonal matrices on
       the left and right to make A bidiagonal: QL*A*QR^T = B, i.e. nonzero only on the
       main diagonal of B and right above it.
         (draw 5 x 5 example)
       This will be the starting point for computing the SVD in chapter 5; once we
       have the SVD of B = U*Sigma*V^T we get the SVD of A as (QL^T*U)*Sigma*(QR^T*V)^T

    Lemma: Hessenberg form is maintained by QR iteration
    proof: If A is upper Hessenberg, so is A - sigma*I, and if A - sigma*I = Q*R,
    it is easy to confirm that Q is also (column i of Q is just a linear combination
    of columns 1:i-1 of A - sigma*I). Then it is also easy to see that R*Q is also
    upper Hessenberg.

    Finally we explain how to do one step of QR iteration on an upper Hessenberg matrix H = A - sigma*I
    in just O(n^2) flops, not O(n^3).
     
    Def: an upper Hessenberg matrix H is unreduced if all the subdiagonals H(i+1,i) are nonzero.
    Note that if H has some H(i+1,i)=0, it is block triangular, and we can solve the eigenproblems
    for H(1:i,1:i) and H(i+1:n,i+1:n) independently.

    Implicit Q Thm: Suppose that Q^T*A*Q is upper Hessenberg and unreduced. Then columns 2 through
    n of Q are determined uniquely (up to multiplying by +-1) by column 1 of Q.

    First we see how to use this to on one step of QR iteration in O(n^2) flops, and then prove it.
    Recall that Q comes from doing A - sigma*I = Q*R, so the first column is simply proportional
    to the first column of A - sigma*I, namely [A(1,1) - sigma; A(2,1); 0, ... ].
    Let Q(1) be a Givens rotation with this first column and form Q(1)^T*A*Q(1); 
    this fills in A(3,1), a so-called "bulge", and make the matrix no longer Hessenberg.
    Our algorithm will remove the bulge, multiplying by more Givens rotations Q(i), making
       Q(n)^T*Q(n-1)^T ... * Q(1)^T * A * Q(1) * ... *Q(n-1)*Q(n)
    upper Hessenberg again, at a cost of O(n^2). Since Hessenberg form is uniquely determined
    by Q(1), this must be the answer.
    (draw picture)
    Since each Q(i) moves the bulge down by one, until it "falls off", this algorithm
    is called "chasing the bulge".

    Proof of the Implicit Q Theorem: let q_i be column i of Q. Then Q^T*A*Q=H implies
    A*Q = Q*H. Look at column 1 of both sides: A*q_1 = H(1,1)*q(1) + H(2,1)*q(2).
    This determines H(1,1), H(2,1) and q_2 uniquely, by doing QR on 
        [q_1,A*q_1] = [q_1,q_2]*[ 1 H(1,1) ]
                                [ 0 H(2,1) ]
    More generally, suppose we have determined q_2 , ... , q_i and columns 1:i-1 of H.
    We get the next column by equating the i-th columns of A*Q=Q*H to get
        A*q_i = sum_{j=1 to i+1} q_j*H(j,i)
    So q_j^T*A*q_i = H(j,i) for j=1 to i, and then 
       A*q_i - sum_{j=1 to i} q_j*H(j,i) = q_i+1 * H(i+1,i)
    gives us q_i+1 and H(i+i,i).

    This is the basis of the LAPACK code xGEES (for Schur form) or xGEEV (for eigenvectors),
    which is used by eig() in Matlab.