Notes for Math 221, Lecture 6, Sep 14 2010

Goal: Explore different ways to multiply matrices as fast as possible
      Why matmul? Easiest to understand, (nearly) all ideas extend
         to all other algorithms we study (which we won't have time to
         explain in as much detail)
      How could there be different algorithms that run at different speeds?
         (1) Two algorithms may do the same number of flops (say 2n^3 for
             matmul) but move different amounts of data between parts of the 
             computer memory, which dominates time; we want the algorithm
             that moves as little data as possible
         (2) On a parallel computer with P processors, need to give each 
             processor (1/P)th of the work, again without
             moving too much data between processors
         (3) Some matmul algorithms use O(n^3) flops, some O(n^2.81), some less...
             (world record currently O(n^2.376), but not practical, because the
              constant hidden in the O(.) is enormous)

We want a simple mathematical model of what it costs to move data,
so we can analyze algorithms easily and identify the one that is fastest.
So we need to define two terms: bandwidth and latency:

First, to establish intuition, consider the time it takes to move cars on 
the freeway from Berkeley to Sacramento:
   The bandwidth measures how many cars/hour can get from Berkeley to Sacramento:
      #cars/hour = density (#cars/mile in a lane) x velocity(miles/hour) x #lanes
   The latency measures how long it takes for one car to get get from Berkeley
      to Sacramento: time(hours) = distance(miles) / velocity(miles/hours)
   So the minimum time it takes n cars to go from Berkeley to Sacramento
      is when they all travel in a single "convoy," that is all as close together
      as possible given the density, and using all lanes:
      time(hours) = time for first car to arrive + time for remaining cars to arrive
                  = latency + n/bandwidth

Now consider the similar situation of reading bits of data from disk, in which
the read head must move toward or away from the center of disk to line up 
with the circle on which the data lies, then wait for the disk to spin around
so the data is underneath the read head, and then read the data:
   The bandwidth measures how many bits/second can be read:
     #bits/second = density (#bits/cm) x velocity(cm/sec) x #read_heads
   The latency measures the (average) time to read one bit:
     time(seconds) = average distance to move read head(cm) / velocity (cm/sec)
                    + average distance to spin disk (.5 revolution) 
                       / velocity (revolutions/sec)
   So the minimum time to read n bits from disk is when they are all packed together
   on one circle, and = latency + n/bandwidth  (on average)

Same idea (harder to explain the physics) for reading bits from main memory to 
cache (draw picture): The data resides in main memory initially, but one
can only perform arithmetic and other operations on it after moving it to the
much smaller cache. The time to read or write w words of data from memory stored
in contiguous locations (which we call a "message" instead of a convoy) is
   time = latency + w/bandwidth 

More generally, to read or write w words stored in m separate contiguous
messages costs m*latency + w/bandwidth.

Notation: We will write this cost as m*alpha + w*beta, and refer to it as the
          cost of "communication".  
          We refer to w as #words_moved and m as #messages.
          We also let gamma = time per flop, so if an algorithm does f flops
          our estimate of the total running time will be
          time = m*alpha + w*beta + f*gamma

We claim that 
  (1) gamma << beta << alpha on modern machines (factors of 100s for memory,
         even bigger for disk or sending messages between processors)
  (2) It is only getting worse over time: gamma is improving ~60%/year,
          beta for memory about 23%/year, alpha for memory about 7%/year
So when we design or choose algorithms, need to pay attention to their communication costs. 

How do we tell if we have designed the algorithm well wrt not communicating too much?
If the f*gamma term in the time dominates the communication cost m*alpha + w*beta
(f*gamma >= m*alpha + w*beta) then the algorithm is at most 2x slower than
as though communication were free. 
When f*gamma >> m*alpha + w*beta, the algorithm is running at near peak arithmetic
speed, and can't go faster.
When f*gamma << m*alpha + w*beta, communication costs dominate.

Notation: computational intensity = q = f/w = "flops per word_moved"
This needs to be large to go fast, since f*gamma >= w*beta means
q = f/w >= beta/gamma >> 1.
       
Now we can state a sequence of questions we will ask, and answer,
about n-by-n dense matrix multiplication C=A*B. (The same questions
will be asked of all the other algorithms of the course, more on that later).
         
(1) What is simplest (sequential) algorithm, and how much does it cost in f=#flops?   
    Assuming the data initially resides in a main memory, and that any data that 
    participates in arithmetic has to moved to a smaller cache of size M, and that
    we want the final answer to end up back in main memory, how many words w does
    this simplest algorithm move between main memory and cache? Is this algorithm
    fast enough?
       Answer:  The usual algorithm consists of 3 nested loops and does 2n^3 flops:
          for i=1 to n, for j=1 to n, for k=1 to n, C(i,j) += A(i,k)*B(k,j);
       There are two cases: when all 3 matrices fit in cache, and when they don't.
       If the 3 matrices do fit in cache (3*n^2 <= M), then this algorithm will
       bring each entry A(i,k) and B(k,j) into cache the first time they are 
       needed, finish computing C without any more need to access memory, 
       and then write C back to memory, for a total of 2n^2 reads from and n^2 writes
       to memory, or w=3n^2 in all.  Clearly this is minimal. 
       But if the matrices do not all fit, i.e. 3n^2 is
       at least several times larger than M, then about n^3 words are moved.
       This algorithm is not fast, because the time to run,
        time = 2n^3*gamma + n^3*beta,
       is dominated by communication n^3*beta >>  2*n^3*gamma, even ignoring latency.
       Another way to say the same thing is to note that the computational
       intensity q = f/w = 2.

(2) Is there an algorithm that moves fewer words? How many?
       Answer: There are (at least 2): "blocked" matrix multiply, and
       "recursive" matrix multiply, to be presentated later:
       When n^2 is larger than M, either algorithm moves just w=O(n^3/sqrt(M)) words.
       Ex: An Intel Nehalem processor has a  M = 32 KByte = 4K double-precision-word
       level-1 (L1) cache (the smallest and fastest of several caches).
       Then sqrt(M) = sqrt(4K) = 64, so the cost of communication (# words moved)
       is reduced by a factor of 64 by these algorithms. And these algorithms
       start paying off when your matrices are about 64 x 64 or larger.

(3) Can we do better?
      Answer: No: Theorem (Hong/Kung, 1981; Irony/Tiskin/Toledo, 2004) 
      #words_moved = Omega(n^3/sqrt(M)) is a lower bound for any algorithm 
      that does the usual 2n^3 flops to perform matmul.

(4) What if my computer has multiple levels of cache, as all real computers do? 
    (picture, office analogy)
      Answer: We can apply the same ideas recursively, and minimize the 
      #words_moved between all possible levels of memory
      Ex: The same Intel Nehalem has a level-2 (L2) cache of size M = 256KB
      = 32K double-precision-words, so sqrt(M) = 181, so it is possible to
      reduce #words_moved between main memory and L2 by factor of about 181x,
      when the matrices are about 181 x 181 or larger.

(5) Do these algorithms also minimize the number of messages?
     Answer: Not unless you use a different data structure to store your
     matrices in memory, storing them as blocks instead of the usual 
     rowwise format (as in C) or columnwise format (as in Fortran):
     you need to use a blocked data structure (or recursively blocked,
     if there are multiple levels of cache).
     If you use the the right data structure, you can attain the lower bound
     #messages = Omega(n^3/M^(3/2))

(6) How fast can we hope that parallel matmul could run on P processors?
      Answer: Each processor should compute as little as possible (2n^3/P flops,
      if the 2n^3 flops are divided evenly) and communicate as little as possible: 
      Theorem:  #words_moved = Omega(n^2/sqrt(P)) and #messages = Omega(sqrt(P)) 
      are lower bounds on the communication costs.
    So time >= (2n^3/P)*gamma + c1*n^2/sqrt(P)*beta + c2*sqrt(P)*alpha
    for some constants c1 and c2.

(7) Is there a parallel Matmul algorithm that runs this fast?
      Answer: Yes: Cannon's algorithm, to be discussed later.

(8) Can we do matmul faster than 2n^3 flops? And keep it numerically stable?
    Are these algorithms worth using?
      Answer: Yes: Strassen's method takes O(n^(log2 7)) ~ O(n^2.81) flops
           The world's record (Coppersmith/Winograd) is O(n^2.376), but impractical
           None of these algorithms are quite as stable as the usual 2n^3 algorithm,
           but are good enough (or can be modified to be good enough).
           Strassen's method is the only one where the constant hidden in the
           O(.) is small enough to be practical.
           Alas, the rules for the Linpack Benchmark for ranking the
           500 fastest computers in the world (see www.top500.org), which
           measures how fast computers can do Gaussian Elimination on
           a large dense matrix, do not permit Strassen to be used, so
           relatively little effort goes into tuning Strassen's method
           to run as fast as possible, compared to the usual 2n^3 algorithm.

(9) Do we know how to minimize #words_moved for algorithms like Strassen?
      Answer: We think so (paper submitted, still many open problems)

(10) Is this all I need to know to write a fast matmul algorithm?
       Answer: No, lots of computer-specific details to get right
       (see graphs at www.cs.berkeley.edu/~volkov/cs267.sp09/hw1
       www.cs.berkeley.edu/~volkov/cs267.sp09/hw1/results
       for evidence). These are not a topic of this course.
       If you are interested, take CS267 next semester.

(11) So are there good implementations I can just use, and not
     worry about the details?
       Answer: Yes: Matrix multiply especially has been tuned for most
       computers by their vendors and is available in their math libraries.

We won't have time to answer all 11 questions in detail for every linear 
algebra problem, such as solving linear systems, least squares, eigenvalue
problems, the SVD, etc., but will try to sketch what is known:
The answers are all roughly the same (papers submitted and in progress), 
but there are some important differences and extensions:
  (1) The lower bound applies to all algorithms, for dense or sparse matrices,
      sequential or parallel. For sparse matrices, the numerator n^3 needs to
      be replaced by #flops actually performed, which is usually much smaller
      than n^3, so that
        #words_moved = Omega(#flops/sqrt(M)) and
        #messages    = Omega(#flops/M^(3/2))
  (2) There are algorithms for most problems that attain these communication
      lower bounds for dense matrices. A few of them, like matmul, do the same
      operations as the conventional algorithm but just in a different order,
      but most of them use mathematically different algorithms than their conventional
      counterparts. Some of these mathematically different algorithms do about the 
      same number of flops as their conventional counterparts, but others are quite
      different, and do several times as many flops (but are still O(n^3)).
      Finding algorithms for all linear algebra problems that minimize
      communication but only do a little more arithmetic is an open problem.
  (3) Computer vendors often have tuned dense linear algebra libraries in their
      math libraries, typically versions of LAPACK and sometimes ScaLAPACK.
      For example Intel has a special team for tuning its math libraries.
      However the algorithms that minimize communication are too recent to
      be in these libraries, and we have ongoing projects to incorporate them.
      Prototypes of the new algorithms are significantly faster in some cases.
      Implementing some of these could constitute class projects.
      Such projects could be oriented toward getting the fastest implementation,
      and others could just implement the algorithms in a language like Matlab,
      and explore their numerical properties without worrying about performance,
      since some may have interestingly different numerical properties than their
      conventional counterparts.
  (4) For sparse matrices, very few algorithms are known that attain the 
      communication lower bounds. There are many open questions.

Now we go back to the beginning of the list of questions, and consider

How to implement Matmul to minimize data movement:

   First look at naive implementation, what's wrong with it, then good one:

           ... multiply C = C + A*B
           for i = 1:n
              for j = 1:n
                 for k = 1:n
                   C(i,j) = C(i,j) + A(i,k)*B(k,j)

    Need to "decorate" code with explicit descriptions of data movement 
    (we assume we control it, for simplicity, even though hardware makes
    the decisions about moving data in real caches).  
    The answer will depend on size of fast memory, call it M.

    Easy case: A, B and C all fit in fast memory (i.e. 3*n^2 <= M),
       ... load all of A, B, C into fast memory
           multiply them using above algorithm
       ... store answers

       #words_moved = 4*n^2 (= lower bound = #inputs + #outputs)
            
    Interesting case: when matrices don't all fit (3*n^2 > M)
    Simplest way to do it: load data just before you need it

           for i = 1:n
              for j = 1:n
                 ... load C(i,j)
                 for k = 1:n
                   ... load A(i,k), B(k,j)
                   C(i,j) = C(i,j) + A(i,k)*B(k,j)
                 ... store C(i,j)

            #words_moved = 2n^3 + n^2, 
            so computational intensity = q = 2n^3/(2n^3 + n^2) ~ 1,
            not large enough to be fast (like beta/gamma)

    A little better, but not much: assume row of A and of C, 
    both fit in fast memory (2*n <= M)

           for i = 1:n
              ... load whole rows A(i,:), C(i,:)
              for j = 1:n
                 for k = 1:n
                   ... load B(k,j)
                   C(i,j) = C(i,j) + A(i,k)*B(k,j)
              ... store row C(i,:)

            #words_moved = n^3 + 3*n^2, 
            computational intensity ~ 2, still not large enough to go fast

    Notation: think of A, B, C as collection of b x b subblocks,
    so that C[i,j] is b x b, numbered from C[1,1] to C[n/b,n/b] (picture)

        for i=1:n/b
           for j=1:n/b
              for k=1:n/b
                 C[i,j] = C[i,j] + A[i,k]*B[k,j]   ... b x b matrix multiplication

    Suppose that 3 bxb blocks fit in fast memory (3*b^2 <= M), then
    we can move data as follows:

        for i=1:n/b
           for j=1:n/b
              ... load block C[i,j]
              for k=1:n/b
                 ... load blocks A[i,k] and B[k,j]
                 C[i,j] = C[i,j] + A[i,k]*B[k,j]   ... b x b matrix multiplication
              ... store block C(i,j)

           #words_moved = 2*(n/b)^2 * b^2    ... for C
                          + 2*(n/b)^3 * b^2    ... for A and B
                        = 2*n^2 + 2*n^3/b   

    Much better: dominant term smaller by a factor of b
    computational intensity = q = 2*n^3/(2*n^2 + 2*n^3/b) ~ b

    We want b as large as possible to minimize #words_moved.
    How large can we make it? 3*b^2 <= M so
    #words_moved >= 2*n^2 + sqrt(3)*2*n^3/sqrt(M) = Omega(n^3/sqrt(M))
    computational intensity = q = O(sqrt(M))

    Theorem: no matter how we reorganize matmul, as long as we
             do the usual 2*n^3 arithmetic operations, but possibly
             just doing the sums in different orders, then
             #words_moved >= Omega(n^3/sqrt(M))

    History: Hong, Kung, 1981
             Irony, Tiskin, Toledo, 2004 (parallel case too)
             Ballard, D., Holtz, Schwartz, 2009 (rest of linear algebra)

    Proof (ITT): Think of your matmul program as doing some sequence of
    instructions: add, mul, load, store, ...
    Arithmetic like add and mul only uses data in fast memory
    Load and store move data between fast and slow memory
    Now just look at all the load/store instructions, 
    say there are W (=#words_moved) of them. We want a lower bound on W.

      Let S(1) be the set of all instructions from the start up to
        the M-th load/store instruction
      Let S(2) be the set of all instructions after S(1), up to 
        the 2M-th load/store instruction
      ...
      Let S(i) be the set of all instructions after S(i-1), up to
        the i*M-th load/store instruction
      ...
      Let S(r) be the last set of instructions (which may have
        fewer than M load/store instructions, if M does not divide W evenly)
                 
    Now (r-1)*M <= W, so a lower bound on r gives a lower bound on W.
           
    Suppose we had an upper bound F on the number of adds & muls that
    could be in any S(t). Since we need to do 2*n^3 adds & muls,
    that tells us that r*F >= 2*n^3 or r >= 2*n^3/F, or
      W >= (r-1)*M >= (2*n^3/F -1)*M is a lower bound.

    How do we get an upper bound F? We need to ask what adds and muls
    we can possibly do in one set S(t). We will bound this by using
    the fact that during S(t), only a limited amount of data is 
    available in fast memory on which to do arithmetic:
      There are the M words in fast memory when S(t) starts,
         eg up to M different entries of A, B, partially computed entries of C
      We can do as many as M more loads during S(t) 
         eg up to M more entries of A, B, partially computed entries of C
      We can do as many as M stores during S(t)
         eg up to M more partially or fully computed entries of C
   Altogether, there are at most 3*M words of data on which to do adds&muls
   How many can we do?

   Clever idea of ITT: represent possible data by 3 faces of an nxnxn cube
       each A(i,k) represented by 1x1 square on one face
       each B(k,j) represented by 1x1 square on another, not parallel, face
       each C(i,j) represented by 1x1 square on another, not parallel, face
   and pair of operations C(i,j) = C(i,j) + A(i,k)*B(k,j) by 1x1x1 cube
       with corresponding projections on each face
   Counting problem: we know the total number of squares on each face that
       are "occupied" is at most 3*M. How many cubes can these "cover"?
   (Picture)

    Easy case: occupied C(i,j) fill up XxY rectangle
               occupied A(i,k) fill up XxZ rectangle
               occupied B(k,j) fill up ZxY rectangle
        "covered cubes" occupy brick of dimension XxYxZ, 
        so at most X*Y*Z of them
   General case: what if we don't have rectangles and bricks?
   Note that volume=XYZ can also be written
              XYZ = sqrt((X*Y)*(X*Z)*(Z*Y))
                  = sqrt((#C entries)*(#A entries)*(#B entries))
    Lemma (Loomis&Whitney, 1949)
         Let S be any bounded 3D set
         Let S_x be projection ("shadow") of S on y,z - plane
           (i.e. if (x,y,z) in S then (y,z) in S_x
         Similarly, let S_y and S_z be shadows on other planes
         Then Volume(S) <= sqrt(area(S_x)*area(S_y)*area(S_z))

    Apply to our case:
     S = set of 1x1x1 cubes representing all mul/add pairs
       = {cubes centered at (i,j,k) representing C(i,j) = C(i,j) + A(i,k)*B(k,j)}
     S_x = set of entries of B needed to perform these
         = {squares centered at (k,j) projected from C(i,j) = C(i,j) + A(i,k)*B(k,j)}
     S_y = set of entries of A needed to perform them
         = {squares centered at (i,k) projected from C(i,j) = C(i,j) + A(i,k)*B(k,j)}
     S_z = set of entries of C needed to perform them
         = {squares centered at (i,j) projected from C(i,j) = C(i,j) + A(i,k)*B(k,j)}

    Lemma => volume(S) = #cubes in S = #flops/2
                       <= sqrt(area(S_x)*area(S_y)*area(S_z))
                       =  sqrt((#C entries)*(#A entries)*(#B entries))
                       <= sqrt((3*M)*(3*M)*(3*M))
    or #flops = F <= 6*sqrt(3)*M^(1.5)

    Finally
      #words_moved = W 
                  >= (r-1)*M 
                  >= (2*n^3/F -1)*M 
                  >= (2*n^3/(6*sqrt(3)*M^(1.5)) -1)*M 
                  = (1/sqrt(27))*n^3/sqrt(M) - M
                  = Omega(n^3/sqrt(M))  as desired

    This complete the proof of the lower bound.

    Here is another algorithm that also attains the lower bound,
    that will use a pattern that will appear later: recursion

    function C = RMM(A,B)  ... recursive matrix multiplication
       ... assume for simplicity that A and B are square of size n
       ... assume for simplicity that n is a power of 2
       if n=1 
          C = A*B  ... scalar multiplication
       else
          ... write A = [ A11  A12 ], where each Aij n/2 by n/2
                        [ A21  A22 ]
          ... write B and C similarly
          C11 = RMM(A11,B11) + RMM(A12,B21)
          C12 = RMM(A11,B12) + RMM(A12,B22)
          C21 = RMM(A21,B11) + RMM(A22,B21)
          C22 = RMM(A21,B12) + RMM(A22,B22)
       endif
       return

    Correctness follows by induction on n
    Cost:  let A(n) = #arithmetic operation on matrices of dimension n
    A(n) = 8*A(n/2)   ... 8 recursive calls
           + n^2      ... 4 additions of n/2 x n/2 matrices
    and A(1) = 1 is the base case

    Solve by changing variables from n=2^m to m:
      a(m) = 8*a(m-1) + 2^(2m),    a(0) = 1
    Divide by 8^m to get 
      a(m)/8^m = a(m-1)/8^(m-1) + (1/2)^m
    Change variables again to b(m) = a(m)/8^m
      b(m) = b(m-1) + (1/2)^m,   b(0) = a(0)/8^0 = 1
           = sum_{k=1 to m) (1/2)^k + b(0)
    a simple geometric sum
      b(m) = 2 - (1/2)^m  -> 2
    or
      A(n) = a(log2 n) = b(log2 n)*8^(log2 n) 
                       = b(log2 n)*n^3
                       -> 2n^3 as expected
    Ask: Why isn't it exactly 2n^3?
         A(n) = b(log2 n)*n^3
              = (2 - 1/n)*n^3
              = 2n^3 - n^2

   Let W(n) = #words moved between slow memory and cache of size M
       W(n) = 8*W(n/2)   ... 8 recursive calls
              + 12(n/2)^2  ... up to 8 reads and 4 writes of n/2 x n/2 blocks
            = 8*W(n/2) + 3n^2
       oops - looks bigger than A(n), rather than A(n)/sqrt(M)
       But what is base case? 
           W(b) = 3*b^2 if 3*b^2 <= M,  i.e. b = sqrt(M/3)
       The base case is not W(1) = something
       Solve as before: change variables from n=2^m to m:
           w(m) = 8*w(m-1) + 3*2^(2m),   w(log2 sqrt(M/3)) = M
       Let mbar = log2 sqrt(M/3)
       Divide by 8^m to get v(m) = w(m)/8^m
           v(m) = v(m-1) + 3*(1/2)^m,    
               v(mbar) = M/8^mbar = M/(M/3)^(3/2) = 3^(3/2)/M^(1/2)
       This is again a geometric sum but just down to m = mbar, not 1
           v(m) = sum_{k=mbar+1 to m} 3*(1/2)^k + v(mbar)
                <= 3* (1/2)^(mbar) + v(mbar)
                 = 3/sqrt(M/3) + 3^(3/2)/sqrt(M)
                 = 2*3^(3/2)/sqrt(M)
       so 
           W(n) = w(log2 n) = 8^(log2 n)*v(log2 n) <= 2*3^(3/2)*n^3/sqrt(M)
       as desired.

   Here is another application of the theorem to the parallel case,
   where "load" means receive data from another processor's memory,
   and "store" means send data to another processor's memory.

   Suppose we have P processors working in parallel to multiply 2 matrices.
   We would like each processor to do 1/P-th of the arithmetic,
   i.e. 2n^3/P flops. We want each processor to store at most 1/P-th
   of all the data (A, B and C), i.e. M = 3*n^2/P. Then the same proof
   of our theorem says the number of load/stores is at least
   Omega((2*n^3/P) / sqrt(3*n^2/P)) = Omega(n^2/sqrt(P))

   Ask: Why does the proof work, even if a processor is only doing
   part of the work in matmul?

   Is there an algorithm that attains this? yes: Cannon's Algorithm 
   (picture of layout and processor grid)
   (picture of algorithm)
      assume s = sqrt(P) and n/s are integers
       let each A(i,j) be an n/s x n/s subblock, numbered from A(0,0) to A(s-1,s-1)
(1)    for all i=0 to s-1, left-circular-shift row i of A by i (parallel)
          ... so A(i,j) overwritten by A(i, i+j mod s)
(2)    for all j=0 to s-1, up-circular-shift column j of B by j (parallel)
          ... so B(i,j) overwritten by B(i+j mod s, j)
       for k= 0 to s-1 (sequential)
          for all i=0 to s-1 and all j=0 to s-1  (all processors in parallel)
(3)          C(i,j) = C(i,j) + A(i,j)*B(i,j)
(4)          left-circular-shift each row of A by 1
(5)          up-circular-shift each column of B by 1
             ... C(i,j) = C(i,j) + A(i, i+j+k mod s)*B(i+j+k mod s, j)
   (Show slides)
   Proof of correctness:
       inner loop adds all C(i,j) = C(i,j) + A(i,k)*B(k,j), 
         just in "rotated" order
     
   Arithmetic Cost: from line (3): s*(2*(n/s)^3) = 2*n^3/s^2 = 2*n^3/P,
       so each processor does 1/P-th of the work as desired
   Communication Cost (#words moved)
       from line (1): s*(n/s)^2 = n^2/s words moved
       from line (2): same
       from line (4): s*(n/s)^2 = n^2/s words moved
       from line (5): same
       total = 4*n^2/s = 4*n^2/sqrt(P),  meeting the lower bound

   Let's consider more complicated timing model:
       time = #flops*gamma + #words_moved*beta + #messages*alpha
       where one message is a submatrix:
       from each of lines (1), (2), (4), (5): s messages, 
       so 4*s = 4*sqrt(P) altogether

   Since #words_moved <= #messages * maximum_message_size
   a lower bound on #messages is
         #messages >= #words_moved / maximum_message_size
   Here we assume that each processor only stores O(1/P-th) of all data,
   i.e. memory size is O(n^2/P). This is clearly an upper bound on
   the maximum_message_size too, so
        #messages >= #words_moved / O(n^2/P)
                   = Omega(n^2/sqrt(P)) / O(n^2/P) = sqrt(P)
   and we see that Cannon's Alg also minimizes #messages  (up to a constant factor)

   Note: lots of assumptions (P a perfect square, s|n, 
         starting data in right place, double memory)
         so practical algorithm (SUMMA) more complicated (in PBLAS library!)
           sends log(P) times as much data as Cannon

   One more matmul topic: can go asyptotically faster than n^3

   Strassen (1967): O(n^2.81) = O(n^(log2 7))

   Recursive, based on remarkable identities: for multiplying 2 x 2 matrices
     Write A = [ A11 A12 ], each block of size n/2 x n/2, same for B and C
               [ A21 A22 ]
     P1 = (A12-A22)*(B21+B22)
     P2 = (A11+A22)*(B11+B22)
     P3 = (A11-A21)*(B11+B12)
     P4 = (A11+A12)*B22
     P5 = A11*(B12-B22)
     P6 = A22*(B21-B11)
     P7 = (A21+A22)*B11
     C11 = P1+P2-P4+P6
     C12 = P4+P5
     C21 = P6+P7
     C22 = P2-P3+P5-P7

   Changes naturally into recursive algorithm:
      function C = Strassen(A,B)
      ... assume for simplicity that A and B are square of size n
      ... assume for simplicity that n is a power of 2
      if n=1 
         C = A*B  ... scalar multiplication
      else
         P1 = Strassen(A12-A22,B21+B22) ... and 6 more similar lines
         C11 = P1+P2-P4+P6 ... and 3 more similar lines
      end

   A(n) = #arithmetic operations
        = 7*A(n/2)  ... 7 recursive calls
          + 18*(n/2)^2  ... 18 additions of n/2 x n/2 matrices

   solve as before: change variables from n=2^m to m, A(n) to a(m):
      a(m) = 7*a(m-1) + (9/2)*2^(2m),    a(0)=A(1)=1
   divide by 7^m, change again to b(m) = a(m)/7^m
      b(m) = b(m-1) + (9/2)*(4/7)^m,   b(0)=a(0)=1
           = O(1)  ... a convergent geometric sequence
   so
      A(n) = a(log2 n) = b(log2 n)*7^(log2 n) = O(7^(log2 n)) = O(n^(log2 7))

   What about #words moved? Again get a similar recurrence:
      W(n) = 7*W(n/2) + O(n^2), and the base case W(nbar) = O(nbar^2) when
        nbar^2 = O(M), ie. whole problem fits in cache.
      Solution is O(n^x/M^(x/2 -1)) where x = log2(7).
      Note that plugging in x=3 gives the lower bound for standard matmul
   Thm (D., Ballard, Holtz, Schwartz, 2010): Omega(n^x/M^(x/2 - 1)) is a
      lower bound on #words_moved for Strassen and algorithms "enough like" Strassen
      (paper in review, lots of open questions).

   Strassen is not (yet) much used in practice
      can pay off for n not too large (a few hundred)
      error analysis slightly worse than usual algorithm:
           Usual Matmul (Q1.10):  |fl(A*B) - (A*B)| <= n*eps*|A|*|B|
           Strassen   ||fl(A*B) - (A*B)|| <= O(eps)*||A||*||B||
      which is good enough for lots of purposes, but when can
      Strassen's be much worse? (suppose one row of A, or one column of B, very tiny)
      Not allowed to be used in "LINPACK Benchmark" (www.top500.org)
      which ranks machines by speed = (2/3)*n^3 / time_for_GEPP(n)
      and if you use Strassen (we'll see how) in GEPP, it does << (2/3)n^3 flops
      so computed "speed" could exceed actually peak machine speed
      As a result, vendors stopped optimizing Strassen, even though it's a good idea!

   Current world's record
      Coppersmith & Winograd (1990): O(n^2.376)
         impractical: would need huge n to pay off
      Rederived by Cohn, Umans, Kleinberg, Szegedy (2003,2005)
         using group representation theory
      All such algorithms shown numerically stable (in norm-sense)
         by D., Dumitriu, Holtz, Kleinberg (2007)
      Given any fast matmul running in O(n^w) operations, it is possible
       to do the rest of linear algebra (solve Ax=b, least square, eigenproblems,
       SVD) in O(n^(w+eta)) for any eta>0, and to do so numerically stably,
       by D., Dumitriu, Holtz (2007)

   A similar divide and conquer idea can be used to make complex matrix
   multiplication a constant factor cheaper than you think. Normally, to multiply
   two complex numbers or matrices, you use the formula
    (A+i*B)*(C+i*D) = (A*C-B*D)+i*(A*D+B*C)
   costing 4 real multiplications and 2 real additions. Here is another way:
    T1 = A*C
    T2 = B*D
    T3 = (A+B)*(C+D)
    Then it turns out that
       (A+i*B)*(C+i*D) = (T1-T2)*(T3-T1-T2)
    for a cost of 3 real multiplications and 5 real additions.
    But since matrix multiplication costs O(n^3) or O(n^x) for some x>2,
    and addition costs O(n^2), for large n the cost drops by a factor 3/4.
    The error analysis is very similar to the usual algorithm.

    Applying the above formula recursively in a different context
    yields an algorithm for multiplying two n-bit integers in 
    O(n^log2(3)) = O(n^1.59) bit operations, as opposed to the conventional O(n^2).
    This topic is taught in classes like CS170.
      
   Final issue, applies to parallel case of all algorithms we discuss, again we
   illustrate just for matmul:
   If you are sitting at your machine, and run your program to multiply matrices 
   twice, with the identical inputs, does it necessarily give identical outputs?
   How might it not? Why should anyone care? 
   Possible class project: experiment with different algorithms that do necessarily
   always get the same answer, and compare their performance and accuracy with
   implementations that do not (i.e. are only optimized for speed).