Notes for Math 221, Lecture 11, Sep 30 2010

Goal: Variations of dense LU that minimize communication

  A little history first:
       In the beginning, was the do-loop. This was enough for the first 
       libraries (EISPACK, for eigenvalues of dense matrices, mid 1960s)
       People didn't worry about data motion, arithmetic was more expensive,
       mostly just tried to get answer reliably

       BLAS-1 library (mid 1970s) - Basic Linear Algebra Subroutines
            Standard library of 15 operations (mostly) on vectors:
              (1)  y = alpha*x + y, x and y vectors, alpha scalar ("AXPY" for short)
                   Ex: innermost loop in GE:
                          for k=i+1:n, A(j,k) = A(j,k) - A(j,i)*A(i,k)
              (2)  dot product
              (3)  2-norm(x) = sqrt( sum_i x(i)^2 )
              (4)  find largest entry in absolute values in a vector (for pivoting in GE)
            Motivations at the time:
               Commonly used functions to ease programming, readability
               robustness (eg avoid over/underflow in 2-norm(x))
               portable and efficient (if optimized on each architecture)
            But there is no way to minimize communication in such simple operations, 
            because they do only a few flops per memory reference.
            (eg 2n flops on 2n data for a dot product, so 1 flop per memory reference).
            So if we implement matmul or GE or anything else
            by loops calling AXPY, it will communicate a lot, #words_moved = O(n^3) ,
            independent of cache size.

        BLAS-2 library (mid 1980s)
           Standard library of 25 operations (mostly) on matrix-vector pairs
              (1) y = alpha*y + beta*A*x ("GEMV" for short),
                      lots of variations for different matrix structures
                            (symmetric, banded, triangular etc)
                      allow A^T, A^* to be used instead of A,
                      obvious optimizations when alpha and/or beta is +1, -1 or 0
              (2) A = A + alpha*x*y^T (matrix + rank-1 update) ("GER" for short)
                  Ex:  2 innermost loops in GE:
                         A(i+1:n,i+1:n) = A(i+1:n,i+1:n) - A(i+1:n,i)*A(i,i+1:n)
              (3) Solve T*x=b where T triangular (used by GE to solve A*x=b)
                     ("TRSV" for short)
           Similar motivation as for BLAS-1, plus more opportunies to optimize
           on the vector computers of the day. But there is still not much reduction in
           communication, still just a few flops per memory reference,
           for example, GEMV reads n^2+2*n+2 words, writes n words, 
           and does 2n^2+3*n flops.
           for a flop-to-words_moved ratio of about 2.
              
        BLAS-3 library (late 1980s)
           Standard library of 9 operations on matrix-matrix pairs
              (1) C = beta*C + alpha*A*B ("GEMM" for short)
              (2) C = beta*C + alpha*A*A^T, C symmetric ("SYRK" for short)
              (3) Solve T*X=B where T triangular, X and B matrices ("TRSM" for short)
           Finally these permit the communication optimizations we discussed before
           So if we use BLAS-3 to build GE (and other routines), 
           we can also hope to move less data.
           Not obvious how to do this: we will discuss this in more detail.

      Note: BLAS-k does O(n^k) operations, for k=1,2,3
      Note: These are supplied as part of the optimized math libraries on
               essentially all computers: use them as your building blocks!
            Reference implementations and documentation at www.netlib.org/blas

   Current state of the art:
      LU on sequential machines:
        LAPACK's "blocked" LU decomposition with partial pivoting spends nearly all 
          its time doing matrix multiplication when n is large, so it runs fast.
          But it still does not minimize #words_moved for all combinations
          of dimension n and cache size M.
        There is a recursive LU (analogous to recursive matmul) that does
          minimize #words_moved, but again not #messages.
        To minimize #messages, we need a different pivoting scheme, called
          "tournament pivoting", to replace "partial pivoting" combined with 
          above techniques.
      LU on parallel machines:
        ScaLAPACK's "blocked" LU also spends nearly all its time doing
          matmuls locally on each processor (like Cannon's matrix multiplication),
          and comes within a factor of O(log P) of minmizing #words_moved. 
          But it sends many more messages than the lower bound of Omega(sqrt(P)),
          n*log(P) messages altogether.
          Highly tuned versions of parallel LU are used to rank the 500 fastest
          computers in the world every 6 months (www.top500.org), so anything
          that makes LU faster is very visible!
        CALU = Communication-Avoiding LU uses tournament pivoting to minimize
          both #words_moved and #messages (modulo factors like log(P)).

   Possible class projects: We have proposals for ways to accomodate other pivoting
   schemes while minimizing communication, in particular
     LU with complete pivoting - too much communication to search for largest 
       matrix entry at each step of LU, so search for best block instead
     LDL' with pivoting - pivoting needs to preserve symmetry, but can't just
       look down diagonal (could all be zero)
     QR with column pivoting (topic of Chap 3)
   Topics:
     Implement (Matlab ok) and do numerical experiments on carefully chosen
       test matrices to evaluate stability compared to usual algorithm
     Write down performance model to count #flops, #words_moved, #messages
     Implement a "fast" version, compare performance to fast versions of LU
       with simpler pivoting (i.e. how much more does careful pivoting cost?)
       
   All these routines count on the BLAS library to supply very fast
   matrix-multiplication and other routines for each processor.

   Recall existing GEPP om mxn matrix: it is a BLAS 2 code
   
     for i = 1 to n-1
        pivot so |A(i,i)| largest among |A(i:m,i)|
        j=i+1
        A(j:m,i) = A(j:m,i)/A(i,i)   ... BLAS 1 - scale a vector
        A(j:m,j:n) = A(j:m,j:n) - A(j:m,i)*A(i,j:n)   ... BLAS2 - rank-1 update

   picture

   General technique to convert BLAS2 code to BLAS3 code (often works):
       Note that a sequence of rank-1 updates A = A - u(i)*v(i)^T
       can also be written as A = A - U*V where U=[...u(i)...], V=[...v(i)...]
       i.e. as gemm. 
   Question: in GEPP, each BLAS2 call appears to depend on the previous one,
       to get the vectors A(j:n,i) = u(i) and A(i,j:n) = v(i)^T, so
       how do we reorganize GEPP to get several at a time?


   Getting started: write
      ( A11 A12 ) = ( L11 0 ) * ( U11 U12 )
      ( A21 A22 )   ( L21 I )   (  0  S   )

      where  A11, L11, U11 are bxb, and S is the Schur Complement.
      Multiplying this out tells us how to compute block entries of L and U:

      ( A11 A12 ) = ( L11 0 ) * ( U11 U12 ) = ( L11*U11 L11*U12 )
      ( A21 A22 )   ( L21 I )   (  0  S   )   ( L21*U11 S+L21*U12)

       => ( A11 ) = ( L11 ) * U11
          ( A21 )   ( L21 )
        This is the LU factorization of the leading b columns of A, for which 
        we use the BLAS2 algorithm presented before:

             for i=1:b
               pivot
               L(i+1:n,i) = A(i+1:n,i)/A(i,i)   ... L overwrites A
               U(i,i:b) = A(i,i:b)              ... U overwrites A (nothing to do)
               A(i+1:n,i+1:b) -= L(i+1:n,i)*U(i,i+1:b)

        The #flops = O(n b^2), from the BLAS2 (rank-1 update) operation.
        If b is small enough so that the nxb block fits in cache (nb <= M)
        then it also minimizes communication (this is an assumption that
        does not always hold!).

        To proceed, we equate the (1,2) blocks to get
           A12 = L11*U12 => U12 = inv(L11)*A12
        which is a triangular solve with many right-hand-sides (BLAS3), doing 
        O(n b^2) flops (see your homework!).

        Finally we equate the (2,2) blocks to get
        S = A22 - L21*U12,   
        which is matrix multiplication (gemm) with O(n^2 b) flops.
        Most of the flops are done in this step, using BLAS3.

  for i = 1 to n-1 step b

       Factor A(i:n,i:i+b-1)  = ( L22 ) * U22 using BLAS2 algorithm
                                ( L32 )

       Apply pivot interchanges to rest of matrix

       A(i:i+b-1, i+b:n) = inv(A(i:i+b-1,i:i+b-1)) * A(i:i+b-1, i+b:n)
              ...  U23 = inv(L22)*~A23 , a single BLAS3 call

       A(i+b:n, i+b:n) = A(i+b:n, i+b:n) - A(i+b:n,i:i+b-1) * A(i:i+b-1,i+b:n)
              ... S = A33 - L32*U23 , a single BLAS3 call

  This algorithm is the basis of the LAPACK routine getrf. As n grows, and if
  b is small compared to n, then the fraction of arithmetic in the call to
  matmul to update S approaches 100%, so this algorithm is usually quite fast.
  But analysis shows that for some sizes of n and M, it is impossible to pick 
  b to reach the lower bound of #words_moved = Omega(n^3/sqrt(M)); the worst case 
  is the relatively small problem when n = M^(2/3), and #words_moved  O(n^3/M^(1/3)). 
  Since this only happens for smaller n, and is too large by a factor of only M^(1/6), 
  it was not noticed for a long time.
  See Lecture 4 of the short course at www.ba.cnr.it/ISSNLA2010/Course_Demmel.htm
  for details.

  Just as there was a recursive version of matmul that minmized
  communication without explicity depending on the cache size M,
  we can do the same for LU (due to Toledo, 1997):

   At top level:  "Do LU on left half of matrix
                   Update right half (U at top, Schur complement at bottom)
                   Do LU on Schur complement"

      function [L,U] = RLU(A)  
        ... assume A is n by m with n >= m, m a power of 2
        if m=1 ... one column
           pivot so largest entry on diagonal, update rest of matrix
           L = A/A(1), U = A(1)
      else
           ... write A = [ A11 A12 ], L1 = [ L11 ]
           ...           [ A21 A22 ]       [ L12 ]
           ... where A11, A12, L11, U1 and U2 are m/2 by m/2
	   ...       A21, A22 and L12 are n-m/2 by m/2
           [L1,U1] = RLU([ A11 ]) ... LU of left half of A
                         [ A21 ]
           A12 = L11 \ A12      ... update U, in upper part of right half of A
           A22 = A22 - L12*A12  ... update Schur complement
           [L2,U2] = RLU( A22 } ... LU on Schur complement
           L = [ L1, [0;L2] ]   ... return complete mxn L factor
           U = [ U1  A12 ]      ... return complete mxm U factor
               [  0  U2  ]
           
     Correct by induction on m
     Analysis: see homework. This algorithm can do partial pivoting,
     but only minimize #words, not #messages. Minimizing #messages
     requires use of tournament pivoting, not partial pivoting.

     Fact: if we do L12*A12 by Strassen, and L11\A12 by 
           (1) inverting L11 by divide-and-conquer:
                inv([ T11 T12 ]) = [ inv(T11) -inv(T11)*T12*inv(T22) ]
                   ([  0  T22 ])   [    0            inv(T22)        ]
           (2) multiplying inv(L11)*A12 etc by Strassen
     then RLU algorithm costs O(n^(log2 7)) like Strassen, but can be slightly
     less stable than usual O(n^3) version of GEPP
     (see arxiv.org/abs/math.NA/0612264)

Last GE implementation: Parallel GE
           
     (Block cyclic layout slides - slide 27 from Lecture 14 of CS267 Spr 10)
     (Parallel GE slides - slides 28-29 from Lecture 14 of CS267 Spr 10)
     (www.cs.berkeley.edu/~demmel/cs267_Spr10)

     However, this is not quite as good as Cannon's MatMul algorithm:
        #flops per processor: about 2n^3/(3*P) - ok, each processor does 1/Pth of work
        #words_moved per processor: O(n^2*log(P)/sqrt(P)) - larger than
           lower bound, but just a factor of log(P), which grows slowly
        #messages: O(n*log(P)) - can be much larger than lower bound of Omega(sqrt(P))
     Seems impossible to preserve partial pivoting and do better:
        need to compute largest entry in each column, so #messages grows like Omega(n)
     An alternative to partial pivoting that (nearly) reaches lower bound, 
        gets interesting speedups, and is "almost" as stable, is "Tournament Pivoting"
        (see papers by D., Grigori, Xiang at bebop.cs.berkeley.edu,
         or Grigori, D., Xiang (SC08) http://hal.inria.fr/inria-00277901/fr/)

     Goal of Tournament pivoting: given n x b block spread over p processors, with
     each processor getting an n/p x b subblock, choose b pivot rows with as 
     few messages as possible. Example with p=4: Each local factorization uses 
     Partial Pivoting:

     A = [ A1 ] = [P1*L1*U1]     P1 chooses the b "best rows" of A1, call them B1
         [ A2 ]   [P2*L2*U2]     P2     "                        A2, call them B2
         [ A3 ]   [P3*L3*U3]     ditto
         [ A4 ]   [P4*L4*U4]     ditto
         [ A5 ]   [P5*L5*U5]     ditto
         [ A6 ]   [P6*L6*U6]     ditto
         [ A7 ]   [P7*L7*U7]     ditto
         [ A8 ]   [P8*L8*U8]     ditto

         [ B1 ] = P12*L12*U12    P12 chooses the b "best rows" of B1 and B2, call them C1
         [ B2 ]
         [ B3 ] = P32*L32*U32    P32 chooses the b "best rows" of B3 and B4, call them C2
         [ B4 ]
         [ B5 ] = P52*L52*U52    P52 chooses the b "best rows" of B5 and B6, call them C3
         [ B6 ]
         [ B7 ] = P72*L72*U72    P72 chooses the b "best rows" of B7 and B8, call them C4
         [ B8 ]

         [ C1 ] = P13*L13*U13    P13 chooses the b "best rows" of C1 and C2, call them D1
         [ C2 ]
         [ C3 ] = P33*L33*U33    P33 chooses the b "best rows" of C3 and C4, call them D2
         [ C4 ]

         [ D1 ] = P14*L14*U14    P14 chooses the b "best rows" of D1 and D2, call them E2
         [ D2 ]

  E2 are the b "best rows" from the original n x b block column A, so 
     (1) pivot these rows to the top of A
     (2) perform b steps of LU with no pivoting

  Theorem (D., Grigori): Tournament Pivoting is "almost" as stable as Partial Pivoting
     in the following senses:
     (1) Tournament Pivoting gives same pivot growth factor ||U||/||A|| as
         Partial Pivoting applies to a larger matrix constructed from blocks of A
         (so if you trusted PP, why not TP?)
     (2) In the theoretic worst case, the pivot growth factor for TP is 
         2^((H+1)n-1), where H = height of tree used (see above), compared to
         2^(n-1) for PP; have not seem TP do worse than 2^(n-1)

  Possible class project, to search for instabilities in TP!