Notes for Math 221, Lecture 11, Sep 30 2010 Goal: Variations of dense LU that minimize communication A little history first: In the beginning, was the do-loop. This was enough for the first libraries (EISPACK, for eigenvalues of dense matrices, mid 1960s) People didn't worry about data motion, arithmetic was more expensive, mostly just tried to get answer reliably BLAS-1 library (mid 1970s) - Basic Linear Algebra Subroutines Standard library of 15 operations (mostly) on vectors: (1) y = alpha*x + y, x and y vectors, alpha scalar ("AXPY" for short) Ex: innermost loop in GE: for k=i+1:n, A(j,k) = A(j,k) - A(j,i)*A(i,k) (2) dot product (3) 2-norm(x) = sqrt( sum_i x(i)^2 ) (4) find largest entry in absolute values in a vector (for pivoting in GE) Motivations at the time: Commonly used functions to ease programming, readability robustness (eg avoid over/underflow in 2-norm(x)) portable and efficient (if optimized on each architecture) But there is no way to minimize communication in such simple operations, because they do only a few flops per memory reference. (eg 2n flops on 2n data for a dot product, so 1 flop per memory reference). So if we implement matmul or GE or anything else by loops calling AXPY, it will communicate a lot, #words_moved = O(n^3) , independent of cache size. BLAS-2 library (mid 1980s) Standard library of 25 operations (mostly) on matrix-vector pairs (1) y = alpha*y + beta*A*x ("GEMV" for short), lots of variations for different matrix structures (symmetric, banded, triangular etc) allow A^T, A^* to be used instead of A, obvious optimizations when alpha and/or beta is +1, -1 or 0 (2) A = A + alpha*x*y^T (matrix + rank-1 update) ("GER" for short) Ex: 2 innermost loops in GE: A(i+1:n,i+1:n) = A(i+1:n,i+1:n) - A(i+1:n,i)*A(i,i+1:n) (3) Solve T*x=b where T triangular (used by GE to solve A*x=b) ("TRSV" for short) Similar motivation as for BLAS-1, plus more opportunies to optimize on the vector computers of the day. But there is still not much reduction in communication, still just a few flops per memory reference, for example, GEMV reads n^2+2*n+2 words, writes n words, and does 2n^2+3*n flops. for a flop-to-words_moved ratio of about 2. BLAS-3 library (late 1980s) Standard library of 9 operations on matrix-matrix pairs (1) C = beta*C + alpha*A*B ("GEMM" for short) (2) C = beta*C + alpha*A*A^T, C symmetric ("SYRK" for short) (3) Solve T*X=B where T triangular, X and B matrices ("TRSM" for short) Finally these permit the communication optimizations we discussed before So if we use BLAS-3 to build GE (and other routines), we can also hope to move less data. Not obvious how to do this: we will discuss this in more detail. Note: BLAS-k does O(n^k) operations, for k=1,2,3 Note: These are supplied as part of the optimized math libraries on essentially all computers: use them as your building blocks! Reference implementations and documentation at www.netlib.org/blas Current state of the art: LU on sequential machines: LAPACK's "blocked" LU decomposition with partial pivoting spends nearly all its time doing matrix multiplication when n is large, so it runs fast. But it still does not minimize #words_moved for all combinations of dimension n and cache size M. There is a recursive LU (analogous to recursive matmul) that does minimize #words_moved, but again not #messages. To minimize #messages, we need a different pivoting scheme, called "tournament pivoting", to replace "partial pivoting" combined with above techniques. LU on parallel machines: ScaLAPACK's "blocked" LU also spends nearly all its time doing matmuls locally on each processor (like Cannon's matrix multiplication), and comes within a factor of O(log P) of minmizing #words_moved. But it sends many more messages than the lower bound of Omega(sqrt(P)), n*log(P) messages altogether. Highly tuned versions of parallel LU are used to rank the 500 fastest computers in the world every 6 months (www.top500.org), so anything that makes LU faster is very visible! CALU = Communication-Avoiding LU uses tournament pivoting to minimize both #words_moved and #messages (modulo factors like log(P)). Possible class projects: We have proposals for ways to accomodate other pivoting schemes while minimizing communication, in particular LU with complete pivoting - too much communication to search for largest matrix entry at each step of LU, so search for best block instead LDL' with pivoting - pivoting needs to preserve symmetry, but can't just look down diagonal (could all be zero) QR with column pivoting (topic of Chap 3) Topics: Implement (Matlab ok) and do numerical experiments on carefully chosen test matrices to evaluate stability compared to usual algorithm Write down performance model to count #flops, #words_moved, #messages Implement a "fast" version, compare performance to fast versions of LU with simpler pivoting (i.e. how much more does careful pivoting cost?) All these routines count on the BLAS library to supply very fast matrix-multiplication and other routines for each processor. Recall existing GEPP om mxn matrix: it is a BLAS 2 code for i = 1 to n-1 pivot so |A(i,i)| largest among |A(i:m,i)| j=i+1 A(j:m,i) = A(j:m,i)/A(i,i) ... BLAS 1 - scale a vector A(j:m,j:n) = A(j:m,j:n) - A(j:m,i)*A(i,j:n) ... BLAS2 - rank-1 update picture General technique to convert BLAS2 code to BLAS3 code (often works): Note that a sequence of rank-1 updates A = A - u(i)*v(i)^T can also be written as A = A - U*V where U=[...u(i)...], V=[...v(i)...] i.e. as gemm. Question: in GEPP, each BLAS2 call appears to depend on the previous one, to get the vectors A(j:n,i) = u(i) and A(i,j:n) = v(i)^T, so how do we reorganize GEPP to get several at a time? Getting started: write ( A11 A12 ) = ( L11 0 ) * ( U11 U12 ) ( A21 A22 ) ( L21 I ) ( 0 S ) where A11, L11, U11 are bxb, and S is the Schur Complement. Multiplying this out tells us how to compute block entries of L and U: ( A11 A12 ) = ( L11 0 ) * ( U11 U12 ) = ( L11*U11 L11*U12 ) ( A21 A22 ) ( L21 I ) ( 0 S ) ( L21*U11 S+L21*U12) => ( A11 ) = ( L11 ) * U11 ( A21 ) ( L21 ) This is the LU factorization of the leading b columns of A, for which we use the BLAS2 algorithm presented before: for i=1:b pivot L(i+1:n,i) = A(i+1:n,i)/A(i,i) ... L overwrites A U(i,i:b) = A(i,i:b) ... U overwrites A (nothing to do) A(i+1:n,i+1:b) -= L(i+1:n,i)*U(i,i+1:b) The #flops = O(n b^2), from the BLAS2 (rank-1 update) operation. If b is small enough so that the nxb block fits in cache (nb <= M) then it also minimizes communication (this is an assumption that does not always hold!). To proceed, we equate the (1,2) blocks to get A12 = L11*U12 => U12 = inv(L11)*A12 which is a triangular solve with many right-hand-sides (BLAS3), doing O(n b^2) flops (see your homework!). Finally we equate the (2,2) blocks to get S = A22 - L21*U12, which is matrix multiplication (gemm) with O(n^2 b) flops. Most of the flops are done in this step, using BLAS3. for i = 1 to n-1 step b Factor A(i:n,i:i+b-1) = ( L22 ) * U22 using BLAS2 algorithm ( L32 ) Apply pivot interchanges to rest of matrix A(i:i+b-1, i+b:n) = inv(A(i:i+b-1,i:i+b-1)) * A(i:i+b-1, i+b:n) ... U23 = inv(L22)*~A23 , a single BLAS3 call A(i+b:n, i+b:n) = A(i+b:n, i+b:n) - A(i+b:n,i:i+b-1) * A(i:i+b-1,i+b:n) ... S = A33 - L32*U23 , a single BLAS3 call This algorithm is the basis of the LAPACK routine getrf. As n grows, and if b is small compared to n, then the fraction of arithmetic in the call to matmul to update S approaches 100%, so this algorithm is usually quite fast. But analysis shows that for some sizes of n and M, it is impossible to pick b to reach the lower bound of #words_moved = Omega(n^3/sqrt(M)); the worst case is the relatively small problem when n = M^(2/3), and #words_moved O(n^3/M^(1/3)). Since this only happens for smaller n, and is too large by a factor of only M^(1/6), it was not noticed for a long time. See Lecture 4 of the short course at www.ba.cnr.it/ISSNLA2010/Course_Demmel.htm for details. Just as there was a recursive version of matmul that minmized communication without explicity depending on the cache size M, we can do the same for LU (due to Toledo, 1997): At top level: "Do LU on left half of matrix Update right half (U at top, Schur complement at bottom) Do LU on Schur complement" function [L,U] = RLU(A) ... assume A is n by m with n >= m, m a power of 2 if m=1 ... one column pivot so largest entry on diagonal, update rest of matrix L = A/A(1), U = A(1) else ... write A = [ A11 A12 ], L1 = [ L11 ] ... [ A21 A22 ] [ L12 ] ... where A11, A12, L11, U1 and U2 are m/2 by m/2 ... A21, A22 and L12 are n-m/2 by m/2 [L1,U1] = RLU([ A11 ]) ... LU of left half of A [ A21 ] A12 = L11 \ A12 ... update U, in upper part of right half of A A22 = A22 - L12*A12 ... update Schur complement [L2,U2] = RLU( A22 } ... LU on Schur complement L = [ L1, [0;L2] ] ... return complete mxn L factor U = [ U1 A12 ] ... return complete mxm U factor [ 0 U2 ] Correct by induction on m Analysis: see homework. This algorithm can do partial pivoting, but only minimize #words, not #messages. Minimizing #messages requires use of tournament pivoting, not partial pivoting. Fact: if we do L12*A12 by Strassen, and L11\A12 by (1) inverting L11 by divide-and-conquer: inv([ T11 T12 ]) = [ inv(T11) -inv(T11)*T12*inv(T22) ] ([ 0 T22 ]) [ 0 inv(T22) ] (2) multiplying inv(L11)*A12 etc by Strassen then RLU algorithm costs O(n^(log2 7)) like Strassen, but can be slightly less stable than usual O(n^3) version of GEPP (see arxiv.org/abs/math.NA/0612264) Last GE implementation: Parallel GE (Block cyclic layout slides - slide 27 from Lecture 14 of CS267 Spr 10) (Parallel GE slides - slides 28-29 from Lecture 14 of CS267 Spr 10) (www.cs.berkeley.edu/~demmel/cs267_Spr10) However, this is not quite as good as Cannon's MatMul algorithm: #flops per processor: about 2n^3/(3*P) - ok, each processor does 1/Pth of work #words_moved per processor: O(n^2*log(P)/sqrt(P)) - larger than lower bound, but just a factor of log(P), which grows slowly #messages: O(n*log(P)) - can be much larger than lower bound of Omega(sqrt(P)) Seems impossible to preserve partial pivoting and do better: need to compute largest entry in each column, so #messages grows like Omega(n) An alternative to partial pivoting that (nearly) reaches lower bound, gets interesting speedups, and is "almost" as stable, is "Tournament Pivoting" (see papers by D., Grigori, Xiang at bebop.cs.berkeley.edu, or Grigori, D., Xiang (SC08) http://hal.inria.fr/inria-00277901/fr/) Goal of Tournament pivoting: given n x b block spread over p processors, with each processor getting an n/p x b subblock, choose b pivot rows with as few messages as possible. Example with p=4: Each local factorization uses Partial Pivoting: A = [ A1 ] = [P1*L1*U1] P1 chooses the b "best rows" of A1, call them B1 [ A2 ] [P2*L2*U2] P2 " A2, call them B2 [ A3 ] [P3*L3*U3] ditto [ A4 ] [P4*L4*U4] ditto [ A5 ] [P5*L5*U5] ditto [ A6 ] [P6*L6*U6] ditto [ A7 ] [P7*L7*U7] ditto [ A8 ] [P8*L8*U8] ditto [ B1 ] = P12*L12*U12 P12 chooses the b "best rows" of B1 and B2, call them C1 [ B2 ] [ B3 ] = P32*L32*U32 P32 chooses the b "best rows" of B3 and B4, call them C2 [ B4 ] [ B5 ] = P52*L52*U52 P52 chooses the b "best rows" of B5 and B6, call them C3 [ B6 ] [ B7 ] = P72*L72*U72 P72 chooses the b "best rows" of B7 and B8, call them C4 [ B8 ] [ C1 ] = P13*L13*U13 P13 chooses the b "best rows" of C1 and C2, call them D1 [ C2 ] [ C3 ] = P33*L33*U33 P33 chooses the b "best rows" of C3 and C4, call them D2 [ C4 ] [ D1 ] = P14*L14*U14 P14 chooses the b "best rows" of D1 and D2, call them E2 [ D2 ] E2 are the b "best rows" from the original n x b block column A, so (1) pivot these rows to the top of A (2) perform b steps of LU with no pivoting Theorem (D., Grigori): Tournament Pivoting is "almost" as stable as Partial Pivoting in the following senses: (1) Tournament Pivoting gives same pivot growth factor ||U||/||A|| as Partial Pivoting applies to a larger matrix constructed from blocks of A (so if you trusted PP, why not TP?) (2) In the theoretic worst case, the pivot growth factor for TP is 2^((H+1)n-1), where H = height of tree used (see above), compared to 2^(n-1) for PP; have not seem TP do worse than 2^(n-1) Possible class project, to search for instabilities in TP!