Notes for Ma221 Lecture 9, Sept 23 2010

Goals for today:
 Gaussian elimination for special structures, that let you save time and
 space if you recognize them.
   Symmetric positive definite matrices (Cholesky)
      save half time, space vs GEPP
   Symmetric matrices
      save half time, space vs GEPP
   Band matrices
      cost goes from O(n^3) to O(bw^2*n), space from n^2 to O(bw*n)
   Sparse matrices
      cost, space can drop a lot, depends a lot on matrix
      complicated algorithms, no one best library

   "Structured matrices"
      dense, but depend on O(n) paramters, e.g. Toeplitz
      there are O(n^2) or faster algorithms for many of these
   Can imagine as many special structures as individual problems, above
      are most common

 SYMMETRIC POSITIVE DEFINITE 

 Recall Def: A spd iff A=A^T and x^T*A*x > 0 for all x neq 0

 Lemma: 1: X nonsing => A spd iff X^T*A*X spd
           Pf: A spd and x neq 0 => X*x neq 0 => (X*x)^T*A*X*x = x^T*X^T*A*X*x neq 0
               X^T*A*X spd and x neq 0 => inv(X)*x neq 0 => (inv(X)*x)^T*X^T*A*X*(inv(X)*x) = x^T*A*x neq 0
        2: A spd and H=A(j:k,j:k) => H spd (H is called a "principal submatrix")
           Pf: A spd and y neq 0 => x = [0;y;0] neq 0 where y in rows j:k => x^*A*x = y^T*H*y neq 0
        3: A spd iff A=A^T and lambda_i > 0
           Pf: A spd => A*Q = Q*Lambda, Q orthogonal matrix of eigenvectors and 
               Lambda diagonal matrix of eigenvalues => 
               A = Q*Lambda*Q^T => x^T*A*x = x^T*Q*Lambda*Q^T*x => if x neq 0 and y=Q^T*x then
               y neq 0 and y^T*Lambda*y = sum_i y(i)^2 * lambda(i) > 0
               => lambda(i) > 0 since this is true for all x and so all y.
        4: A spd => a(i,i)>0 and max_ij |a(i,j)| = max_i |a(i,i)|
           Pf: e(i)^T*A*e(i) = A(i,i) > 0. Suppose A(i,j) with i neq j were the largest
               entry in absolute value. Let x be a vector of zeros except for x(i)=1 and x(j)=-sign(A(i,j))
               Then x^T*A*x = A(i,i) + A(j,j) - 2*abs(A(i,j)) <= 0, a contradiction.
        5: A spd iff A=L*L^T where L lower triangular with positive diagonal entries
           Pf: We use induction on n, showing Schur complement spd:
               Let A = [ A11  A12 ] = [ sqrt(A11)       0 ]  * [  sqrt(A11)  A12/sqrt(A11) ]
                       [ A12' A22 ]   [ A12'/sqrt(A11)  I ]    [     0            S        ]
                           where A22 = S + A12'*A12/A11  so S = A11 - A12'*A12/A11 is symmetric
                     = [ sqrt(A11)      0 ] * [ 1 0 ] * [ sqrt(A11)  A12/sqrt(A11) ]
                       [ A12'/sqrt(A11) I ]   [ 0 S ]   [   0           I          ]
                     =  X^T * [ I 0 ] * X
                              [ 0 S ]
               So [ I 0 ] is spd, and so S is spd.
                  [ 0 S ]
               By induction S = Ls * Ls^T where Ls lower triangular with positive diagonal:
                   A = [ sqrt(A11)      0 ] * [ 1   0     ] * [ sqrt(A11)  A12/sqrt(A11) ]
                       [ A12'/sqrt(A11) I ]   [ 0 Ls*Ls^T ]   [   0           I          ]
                     = [ sqrt(A11)      0  ] * [ sqrt(A11)  A12/sqrt(A11) ]
                       [ A12/sqrt(A11)  Ls ]   [    0         Ls^T        ]
                     = L*L^T
              
   Def: Suppose A spd, then A = L*L^T with L lower triangular with positive diagonal called 
        Cholesky factorization
   Lemma: Cholesky factorization unique
       Proof: If not, suppose L1*L1^T = L2*L2^T so inv(L2)*L1 = L2^T*inv(L1^T) 
              or lower triangular = upper triangular, so both = D, diagonal, i.e. 
              inv(L2)*L1 = D => L1 = L2*D and L2^T*inv(L1^T) = D or L2^T = D*L1^T or L2 = L1*D = L2*D^2
              or I = D^2, so each diagonal entry D(i,i) = +-1. But if diagonal entries of L1 and L2 = L1*D
              are both positive, then D(i,i) = 1, D=I, and L1=L2.

   Relationship of Cholesky to LU decomposition without pivoting:
   Lemma: If A spd and A=L*U with L unit lower triangular, let D be diagonal with = D(i,i) = sqrt(U(i,i)) > 0.
          Then A = (L*D)*(inv(D)*U) = Ls*Ls^T is the Cholesky factorization
      Proof: Homework!

   This lemma suggests that any algorithm for LU can be modified to do Cholesky,
   and in fact use half the work and space since L and U are simply related
   Here is the simplest version:
   Cholesky algorithm to compute A = L*L^T just follows induction proof:
    for j = 1:n
       L(j,j) = sqrt (A(j,j) - sum_{i=1}^{j-1} L(j,i)^2 )
       L(j+1:n,j) = (A(j+1:n,j) - L(j+1:n,1:j-1)*L(j,1:j-1)^T)/L(j,j)

   But all the other ideas (blocking, recursive, parallel, communication lower bounds etc) apply

   Error analysis: Same approach as to LU yields analogous bound:
         (A+E)(x+dx) = b where |E| <= n*macheps*|L|*|L^T|

       But since we do not pivot, we need to use a different approach to get a bound
   (|L|*|L|^T)_{ij} = sum_k |L_{ik}|*|L_{jk}|
                   <= norm(L(i,:))*norm(L(j,:))     ... Cauchy-Schwartz inquality
                    = sqrt(A_{ii})*sqrt(A_{jj})
                   <= max entry in A
     
 SYMMETRIC INDEFINITE

  Also possible to save half space, flops
  A = L*D*L^T where D has 1-by-1 and 2-by-2 diagonal blocks
  More complicated pivoting: consider A = [[0 1];[1 0]]
  Note: not clear how many of the ideas used to make GEPP go fast
   can be applied to LDL^T factorization, because of the more
   complicated pivoting (need to search down a column, and
   across a row). It is possible to have an explicitly
   blocked code (see LAPACK routine ssytrf). 
*  What about recursive (cache-oblivious) code? What about explicit parallel code?
*  Not done, to the best of my knowledge

 BAND MATRICES

  Def lbw = lower bandwidth, ubw = upper bandwidth
  Case without pivoting (eg Cholesky): 
     ubw(U)=ubw(A), lbw(L)=lbw(A)
     (picture)
     Cost = 2*n*ubw*lbw + n*lbw  = O(n) for narrow bandwidths
  Case with pivoting:
     ubw(U)=ubw(A)+lbw(A)
    "lbw(L)" = lbw(A) (doesn't look banded, but uses same amount of storage)
     picture, Matlab spyplots
     Ex: n=20,lbw=4,ubw=3; A = tril(triu(randn(n,n),-lbw),ubw);
         Ad = A + 100*eye(n); % no pivoting required
         [Ld,Ud,Pd]=lu(Ad);
         figure(1), 
         subplot(221),spy(Ad),title('Ad'),subplot(222),spy(Ld),title('Ld'),
         subplot(223),spy(Ud),title('Ud'),subplot(224),spy(Pd),title('Pd')
         figure(2), 
         [L,U,P]=lu(A);
         subplot(221),spy(A),title('A'),subplot(222),spy(L),title('L'),
         subplot(223),spy(U),title('U'),subplot(224),spy(P),title('P')

*  possible class project: recognize bandedness on the fly in LAPACK
*  drop cost of LU from O(n^3) to O(n) flops + O(n^2) comparisons for 
*  narrow band matrix (in dense format), without changing cost for dense matrix

  Band matrices often arise from discretizing differential equations:
  Sturm-Liouville problem for -y''(x) + q(x)*y(x) = r(x)
     on an interval [a,b], y(a)=alpha, y(b)=beta, q(x) >= qbar > 0, discretize
     at x(i) = a+ih, h=(b-a)/(N+1), unknowns y(1) = y(x(1)),...,y(N) = y(x(N)), 
     Approximate y''(i) = (y(i+1) - 2*y(i) + y(i-1))/h^2 so
     Letting q(i) = q(x(i)) and r(i) = r(x(i)) we get linear system
             -(y(i+1) - 2*y(i) + y(i-1))/h^2  + q(i)*y(i) = r(i)
     or A*y = v where 
        v = r + [alpha/h^2 0 ... 0 beta/h^2]'
        A = diag(2/h^2 + q(i))   ... diagonal
            + diag(-1/h^2 , 1 )
            + diag(-1/h^2 , -1 )  ... so A symmetric

   To prove positive def: 
     Gershgorin's Theorem:  All eigenvalues lambda of A lie in n circles in the complex plane,
        where circle i has center A(i,i) and radius sum_{j=1 to n except i} |A(j,i)|
      proof: if A*x = lambda*x , suppose |x(i)| = largest entry in absolute value of x.
        or   (A(i,i)-lambda) = sum_{j=1 to n except i} A(i,j)*(x(j)/x(i))
        or   |A(i,i)-lambda| <= sum_{j=1 to n except i} |A(i,j)|*|x(j)/x(i)|
                             <= sum_{j=1 to n except i} |A(i,j)|*1  as desired

   Apply Gershgorin to A=A^T: all eigenvalues in circles center at 2/h^2 + q(i) > 2/h^2 + qbar, 
      with radius 2/h^2, so must be real and positive.

   Now consider Poisson's equation:
      d^2 u(x,y)/dx^2 + d^2 u(x,y)/dy^2 + q(x,y)*u(x,y) = r(x,y)
   in square:  0 <= x <= 1 and 0 <= y <= 1 with boundary conditions: u(x,y) given on
   boundary of square.
   Typical example: Given temperature on boundary of a square of uniform material,
   what is the termperature everywhere in the middle? Here q=r=0.
   We discretize the square with a nxn mesh:
    h=1/(n+1)
    x(i) = i*h for i=1:n, so x(0)=0, x(n+1) = 1
    y(j) = j*h for j=1:n, so y(0) = 0, y(n+1) = 1
    u(i,j) = u(x(i),y(j)),  same for q(i,j) and r(i,j)
    Again approximate
      d^2 u(x,y)/dx^2 ~ [ u(x-h,y) - 2*u(x,y) + u(x+h,y) ] / h^2
      d^2 u(x,y)/dy^2 ~ [ u(x,y-h) - 2*u(x,y) + u(x,y+h) ] / h^2
    Add these, evaluate at (x(i),y(j)), so
      d^2 u(i,j)/dx^2 + d^2 u(i,j)/dy^2 ~ [ u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) - 4*u(i,j)]/h^2
    (Illustrate for n=4, 6 using A=TwoDL(n))
    Using "natural order" (rowwise or columnwise ordering of mesh points (i,j)), we
    get an spd band matrix of dimension n^2 and bandwidth n
    (later we will have explicit expressions for all eigenvalues and eigenvectors of this matrix)
    (preview: can build eigenvector matrix from FFT)
    Note: banded, but at most 5 nonzeros per row, so should really think of it as sparse...

 SPARSE MATRICES

  small matlab example to show important of ordering on "arrowhead matrix"
    A = eye(8);A(1,2:8)=.1;A(2:8,1)=.1;
    [L,U]=lu(A);A,L,U, 
    figure(1), clf, spy(L,'b'), hold on, spy(U,'r'), spy(A,'k')
  Result is dense! So performing LU and storing L and U would
  cost as many flops and words as a dense matrix, O(n^3) and O(n^2). resp.
  Now let's try LU on P*A*P^T, where P is a permutation
  that reverses the order of rows and columns
    Arev=A(8:-1:1,8:-1:1);
    [Lrev,Urev]=lu(Arev); Arev, Lrev, Urev
    figure(1), clf, spy(Lrev,'b'), hold on, spy(Urev,'r'), spy(Arev,'k')
  Result is very sparse! And if you rewrite LU to take advantage of this,
  to neither store nor compute entries that are certain to be zero,
  the cost in flops and words would drop to O(n), much smaller.
  Matlab has many built-in sparse matrix operations (LU, Cholesky, etc),
  that take advantage of sparsity like this; do "help sparse" to learn more.

  Let's see what happens to the 2D Poisson equation:
   A = TwoDL(6); R = chol(-A); ... R = upper triangular Choleksy factor 
   figure(1), clf, spy(R,'b'),hold on, spy(A,'k')
  So the band fills in entirely, costing O(N*bw^2) = N*sqrt(N)^2 = N^2 = n^4
  for an nxn mesh with N = n^2 unknowns. We'll see that we can do much better,
  O(n^3) instead, for a clever choice of permutation P in P*A*P'.

  EX: sparse Cholesky on finite element structure (Figures 2.9-2.11 in text)

  MechStructureMesh1.eps - basic mesh, 483 nodes, one unknown for
     each node, a(i,j) nonzero if edge connects nodes i and j

  Def: A weighted, undirected graph G is a collection V,E,W of sets
       V = vertices (aka nodes), 
       E = edges connected pairs of vertices (u,v) 
           (note: (u,u) is allowed; undirected means (u,v) is the same as (v,u)
       W = weight (number) associated with each edge

  There is an obvious one-to-one relationship between a weighted undirected 
  graph G and a symmetric matrix, and we will use the notations interchangeably:
     V = one row and column per vertex, 
     E = locations of nonzeros (note: (u,u) on diagonal)
     W = values of nonzeros
     
  MechStructureMesh2.eps - to make a matrix, need to number nodes in some way 

  MechStructureANatural.eps - color coded matrix to match structure
     => 3971 nonzeros => 3971/483^2 = 1.7% filled

  MechStructureNatural.eps - original matrix and Cholesky factor
     nnz = 11533, up from 3971/2, 3e5 flops (vs (1/3)n^3 = 3.7e7)
     note "skyline" structure of matrix, which some sparse Cholesky routines exploit
     red entries are new nonzeros created by factorization, called "Fill-in"

  MechStructureRCM.eps - original matrix with reverse Cuthill-McKee ordering
     tends to make bandwidth narrow, nnz = 9073 (better), 1.8e5 flops (better)

  MechStructureMinDeg.eps - another popular order
     nnz = 8440, 2e5 flops

  MechStructureRand.eps - a bad ordering:
     nnz = 37906, 5.6e6 flops

  Discretization of air around wing tells similar story 
  showdemo airfoil in Matlab (figure 2.12 in text)

  Challenges to implementation
     Storing and operating on only nonzeros in an efficient way:
        Simplest imaginable data structure: 
          COO = coordinate format: list of nonzero entries and their locations 
                (a(i,j),i,j) in some order (example)
                   A = [ 2 0 7 0 5 ] , COO = ((2,1,1),(7,1,3),(5,1,5),(1,2,2),(4,2,3),(3,2,5),(8,3,3))
                       [ 0 1 4 0 3 ]
                       [ 0 0 8 0 0 ]
        We can do better:
          CSR = Compressed Sparse Row:
             val = array of nonzero entries in each row, from row 1 to row n,
                   from left to right
             col_index = columns index of each nonzero  (val(i) is in col(a) of sparse matrix)
             rowbegin = pointer to start of each row: entries of row i lies in 
                        entries rowbegin(i) through rowbegin(i+1)-1 of val() and col()
          may be 2/3 as much memory as COO (example)
              For same A as above:
                   val = (2 7 5 1 4 3), col_index = (1 3 5 2 3 5 3), rowbegin = (1 4 7 8)
          many other formats: defer details until discussion of iterative methods

     Picking the "best order" of rows and columns
        Want to minimize time, memory, and get the right answer (stability)
        Order effects fill-in, which effects time and memory (less is better!)
           so want to pick order to minimize fill-in
        Order effects backward stability for LU (and symmetric indefinite LDL^T), but not Cholesky
           Stability proof for Cholesky did not depend on order
           Cholesky is easiest case: can pick order just to minimize fill in

   What is the best order for Cholesky?
     Thm: Picking the optimal order (out of all n! possibilities) is NP-hard, i.e.
          any algorithm will have cost exponential in n, overwhelming everything else
     So need to use heuristics:

        RCM = Reverse Cuthill-McKee = Breadth-First Search order, reversed
          Consider matrix as Graph
          Def: A path in a graph from vertex v1 to vk is a set of edges
               (v1,v2), (v2,v3), ... , (v_k-1,vk) connected end-to-end
          Def: The distance between any two vertices is the length of the 
               shortest path (fewest edges) connecting them
          RCM: (1) pick any vertex, call it the root
               (2) compute distance of all other vertices from root, and label them
                   by distance, (so distance=0 is the root, distance=1 are the vertices
                   with an edge to the root, etc)
                   Cost = O(#nonzeros in matrix), using Breadth-First-Search, cheap!
               (3) Order vertices in reverse order by distance (farthest first)
          Fact: Vertices at distance k can only be directly connected to vertices
                at distance k-1 or k+1 (otherwise the distance would be wrong!)
          Fact explains why RCM ordering tends to make matrix entries close to diagonal:
          matrix becomes block tridiagonal, with vertices at distance k making up
          diagonal block k.  So it is like a band matrix, and limits fill-in to the band.
          Note: called symrcm in Matlab

	   Minimum Degree (MD): 
          Def: the degree of a vertex in the graph is the number of edges 
               for which it is an endpoint
          Fact: if vertex corresponding to row/column i is used as a pivot, and has degree d,
                then we need to do d^2 muls and adds, and so can fill in at most d^2 entries
                (if they were originally nonzero, and ignoring symmetry)
          MD: at each step pick pivot to have minimum degree 
              (note: as factorization proceeds, need to update matrix and its graph,
               so degree can change; use latest degree)
          Question: How does graph change after one elimination step?
          Variants called amd, symamd, colamd in matlab

     Nested Dissection
          General idea:
             "Bisect" vertices of graph into 3 sets V = V1 u V2 u Vs (s for "separator") 
             such that 
                (1) about the same number of vertices in V1 and V2
                (2) Vs much smaller than V1 and V2
                (3) there are no edges directly connecting vertices in V1 and V2
             Many algorithms for this (see Lectures 13 on graph partitioning from CS267, Spring 09)
             Now reorder vertices with V1 first, then V2, then Vs: What does matrix look like?
              A =  [  A1    0   A1s ] so after Cholesky L = [ L1    0     0  ]
                   [   0   A2   A2s ]                       [ 0     L2    0  ]
                   [  A1s' A2s' As  ]                       [ L1s   L2s   Ls ]
              in particular, A1 = L1*L1^T and A2 = L2*L2^T are independent 
              Cholesky factorizations, and (2,1) of L block stays zero.
             So how do we do Cholesky on A1 and A2? Use same idea recursively. 
             (see CS267, Spring 09, lecture 13, slide #9)

          Illustrate on matrix from 2D mesh 
          (draw ordering on mesh) 
          (MatLab: m=3 (n=2^m-1), NestedDissection)

          Thm (George, Hoffman/Martin/Rose, Gilbert/Tarjan, 1970s & 1980s): 
            Any ordering for Cholesky on a 2D n x n mesh
            (so the matrix is n^2 x n^2) has to do at least Omega(n^3) flops, and this is attained by
            nested dissection ordering. (If it were dense, it would cost O(n^6), a lot more).
            This extends to "planar graphs", i.e. graph you can draw on
            paper without crossing any edges.

         Thm (Ballard, D., Holtz Schwarz, 2009) Lower bound on sparse LU, Cholesky is Omega(#flops/sqrt(M))
            where #flops is number actually performed. So lower bound on #words_moved for Cholesky on
            nxn mesh is Omega(n^3/sqrt(M))

         Thm (David, D., Grigori, Peyronnet 2010). Possible to achieve lower bound for nxn mesh 
            of Omega(n^3/sqrt(M)) words moved by Cholesky with nested dissection ordering, 
            if properly implemented

         Contrast: band solver would cost O(dimension*bw^2) = n^(2+2) = n^4 flops

          (see examples from CS267 Lecture 13, 
           www.cs.berkeley.edu/~demmel/cs267_Spr09/lecture13_partition_jwd09.ppt)

          It's a good idea for 3D meshes too: n x n x n mesh (so matrix is n^3 by n^3) costs n^6 to factor
          using Cholesky with nested dissection ordering (as opposed to n^9 if it were dense, or n^(3+2+2)=n^7
          if banded)

          Lots of clever algorithms and software available for partitioning graphs:
            See Lecture 13 in CS267 from Spring 09
            Packages in include, METIS, ParMETIS and Zoltan

         So the overall sparse Cholesky algorithm is as follows:
           Choose ordering (RCM, MD, ND)
           Build data structure to hold A and (to be computed) entries of L
           Perform factorization

         (Next sub section contains next level of detail, skip if no time)
           How do we build the data structure for L more cheaply than just performing the
           factorization to see what L's nonzero entries are? 
           Use graph theory, and assume that there is no cancellation, i.e. that once
             an entry of A gets filled in, it never cancels out later (which would
             be highly unlikely anyway).
           To motivate result, let us ask how L(i,j), i>j, could come to be nonzero.
           Recall that at elimination step k the current entry of the Schur Complement
           stored at A(i,j) get updated to be A(i,j) - L(i,k)*(L^T)(k,j) = A(i,j) - L(i,k)*L(j,k)
             (1) A(i,j) could be initially nonzero. So in the graph G corresponding to A,
                 there is an edge (i,j), so a path of length 1 from i to j.
             (2) A(i,j) might be zero, but it may get filled in because there is some 
                 pivot step k, k < i and k < j, that fills it in, because 
                    A(i,k) and so L(i,k) is nonzero, and
                    A(j,k) and so L(j,k) is nonzero.
                 Thus there are edges (i,k)(k,j) in G, a path of length 2 from i to j where k < i and k < j
             (3) Next suppose A(i,j) and A(i,k) are zero, but L(i,k) is nonzero because
                 it has been previously filled in by some
                     A(i,k) = A(i,k) - L(i,m)*L(k,m) where m < i and m < k
                 where
                     A(i,m) and so L(i,m) is nonzero and
                     A(k,m) and so L(k,m) is nonzero
                 Thus there are edges (i,m),(m,k),(k,j) in G, a path of length 3 from i to j
                 with m and k < i and < j
           More generally, we get 
           Thm (George & Liu, 1981): 
                L(i,j) is nonzero (assuming no exact cancellation) if there is a path from 
                i to j in the graph G of A passing through vertices k satisfying k < i and k < j
           Thm (Gilbert & Peierls)
                Cost of determining structure of L no more than actually doing factorization

   What about nonsymmetric matrices, where you need to pivot for sparsity and stability?
     Threshold pivoting: among pivot choices in column within a factor of 2 or 3 of the largest,
         pick the one with the least fill-in (so may need to change data structure on the fly)
     Static Pivoting (used in SuperLU_DIST)
         (1) Permute and scale matrix to try to make diagonal as large as possible:
	 Thm: For any A there is a permutation P and diagonal matrices D1 and D2 such that
	 B = D1*A*P*D2 has B(i,i) = 1 and all other |B(i,j)| <= 1
         (2) Do LU with no pivoting; this means we can prebuild all the data structures, rather
             than figure them out on the fly, which is faster
         (3) If a prospective pivot is too small, just make it big. (This happens rarely in 
             practice.) As a result we get the LU factorization of A + a low rank change
             (the rank = the number of modified pivots), which we can fix with the 
             Sherman-Morrison-Woodbury formula or an iterative method like GMRES (to be
             discussed later).

  Ultimate summary of algorithms for solving Poisson equation: what have we filled in so far?
    see slide 7 of Lecture "Autotuning of Common Computationalal Patterns" at
     parlab.eecs.berkeley.edu/2010bootcampagenda

  Survey of available sparse matrix software (linked from class web page)
    (see "Updated Survey of sparse direct linear equation solvers"
     http://crd.lbl.gov/~xiaoye/SuperLU/SparseDirectSurvey.pdf )

  Special matrices, depending only on O(n) parameters - a few examples:
      Cauchy: C(i,j) = 1/(x(i)+y(j)), Toeplitz, Vandermonde, displacement rank
      Toeplitz: T(i,j) = x(i-j),    constant along diagonals
      Hankel: H(i,j) = x(i+j),    constant along "anti-diagonals"
      Vandermonde: V(i,j) = x(j)^(i-1)
     
  Eg: Solving V^T z = b means finding z(i) so that
     sum_{i=1 to n} x(j)^(i-1)*z(i) = b(j)
    i.e. finding polynomial coefficients  so to match values b(j) at points x(j)
    i.e. polynomial interpolation; we know how to do this in O(n^2) time using
         eg using "Newton interpolation"
    Another trick works for V z = b
 
  Eg: Multiplying by Toeplitz matrix is same as convolution,
      which we know how to do fast by FFT

  Eg: Solving with a Cauchy, also interpretable as interpolation

  Common structure: A*X + X*B = low rank for some simple A and B
  Def: Called "displacement rank"
  Ex: V*diag(x_i) = "V shifted up"
      P * V = [ 0 1 0 ... 0 ] * V = "V shifted up", 
              [ 0 0 1 ... 0 ]
                     ...
              [ 0 0 0 ... 1 ]
              [ 1 0 0 ... 0 ]
      difference  = nonzero in last row only = rank 1

  Ex: P*T - T*P = T shifted up - T shifted right
                = nonzero in first col, last row = rank 2

  Ex: diag(x_i)*C + C*diag(y_i) = all ones = rank 1

  Theory (Kailath et al): there are O(n^2) solvers if displacement rank low
      (stability can be a problem)

Final topic in Ax=b: How to get a more accurate answer if the condition number is large

  Recall that error bound is roughly
     ||x_true - x_compute|| / ||x_true|| = cond(A) * pivot_growth * O(macheps)
  where we have put dependence on dimension inside the O().  So 
     (1) how do we know if this is too large?
          Answer: can estimate cond(A) in O(n^2) extra work, after doing GEPP, 
                  pivot_growth also costs just O(n^2), to compute ||U||*||A||
          (Plot expected error vs condition number on log scale)
     (2) what do we do about it?
          Answer 1: Do everything in higher precision - expensive!
          Answer 2: Do just O(n^2) extra work in higher precision:
                     iterative refinement (much cheaper)

    Basically just Newton's method:

       Do GEPP to solve Ax=b, call initial solution x(1)
       i = 1
       repeat
         r = A*x(i) - b    ... in double precision, but just O(n^2)
         solve A*d = r     ... using existing LU factors, just O(n^2)
         update x(i+1) = x(i) - d  
       until "convergence"

       why compute r in double precision? 
          otherwise computed residual r is mostly noise (though some benefit)
       testing "convergence" tricky: 
          how can we avoid getting fooled by very ill-conditioned matrix
            that "accidentally" converges?
          see www.netlib.org/lapack/lawnspdf/lawn165.pdf for details
       see figures 33-35 of www.cs.berkeley.edu/~demmel/Future_ScaLAPACK_v7.ppt for results
       Available in LAPACK as sgesvxx, dgesvxx, etc
       interesting for different reasons on some Cell, GPUs, similar platforms
          single precision *much* faster than double precision (sometimes 10x)
          so to get the "usual" accuracy in double
             do LU in single
             run iterative refinement, only computing r in double
             convergence test is simply "as good as running GEPP in double
               without refinement", i.e.
                   norm(A*x-b) = O( epsilon )*norm(A)*norm(x)
          see www.netlib.org/lapack/lawnspdf/lawn177.pdf for details (on Cell processor)