Notes for Math 221, Lecture 7, Sep 15 2014

Goal: Understand the real cost (in time) of running an algorithm, so we can 
design algorithms that run as fast as possible.  Traditionally we just count
the number of arithmetic operations, but in fact multiplication and addition
are the cheapest operations an algorithm performs: it can be orders of 
magnitude more expensive to get that data from wherever it is stored in the
computer (say main memory) and bring it to the part of the hardware that 
actually does the arithmetic.
(draw pictures of simplest models of sequential and parallel computers).
For example, on a sequential computer, the main cost of naive algorithm would
be moving data between main memory (DRAM) and on-chip cache, and on a 
parallel computer, it is moving data between processors connected over a 
network.  As a shorthand, we call all forms of moving data "communication", 
and seek to minimize it if possible.

Sometimes the difference between a naive, slow algorithm that communicates
a lot, and a clever, fast algorithm that communicates much less, is "just" 
the order in which operations are performed, and perhaps a different data 
structure to store the matrices, but that otherwise the algorithm is 
mathematically identical (perhaps with small changes in round off errors). 
If this were always the case, then this topic might be more suited to a 
computer science course.  But in fact for many problems a different 
mathematical algorithm is required to enable communicating less, which is 
why it is a topic in this course.

To illustrate the ideas involved, we will use the simplest algorithm, matrix
multiplication, even though the algorithm that communicates least (indeed, 
attains a lower bound, and so is optimal) really does just do the standard 
operations in a different order. Later we will will sketch more complicated 
algorithms (eg for linear systems and least squares) that are mathematically
different.

Many of these algorithms have only been developed in the last few years,
and shown to attain lower bounds on communication. They are not yet available
in standard libraries like LAPACK (used in MatLab) and ScaLAPACK (the 
parallel version), but we are working on that, and can suggest various 
related class projects.

We want a simple mathematical model of what it costs to move data,
so we can analyze algorithms easily and identify the one that is fastest.
So we need to define two terms: bandwidth and latency:

First, to establish intuition, consider the time it takes to move cars on 
the freeway from Berkeley to Sacramento:
Bandwidth measures how many cars/hour can get from Berkeley to Sacramento:
 #cars/hour = density (#cars/mile in a lane) x velocity(miles/hour) x #lanes
Latency measures how long it takes for one car to get get from Berkeley
to Sacramento: 
  time(hours) = distance(miles) / velocity(miles/hours)
So the minimum time it takes n cars to go from Berkeley to Sacramento
is when they all travel in a single "convoy," that is all as close together
as possible given the density, and using all lanes:
  time(hours) = time for first car to arrive 
                + time for remaining cars to arrive
              = latency + n/bandwidth

Now consider the similar situation of reading bits of data from disk, in 
which the read head must move toward or away from the center of disk to 
line up with the circle on which the data lies, then wait for the disk to 
spin around so the data is underneath the read head, and then read the data:
The bandwidth measures how many bits/second can be read:
   #bits/second = density (#bits/cm) x velocity(cm/sec) x #read_heads
The latency measures the (average) time to read one bit:
   time(seconds) = average distance to move read head(cm) / velocity (cm/sec)
                   + average distance to spin disk (.5 revolution) 
                       / velocity (revolutions/sec)
So the minimum time to read n bits from disk is when they are all packed 
together on one circle, and = latency + n/bandwidth  (on average)

The same idea (harder to explain the physics) applies to reading bits from 
main memory to cache (draw picture): The data resides in main memory 
initially, but one can only perform arithmetic and other operations on it 
after moving it to the much smaller cache. The time to read or write w words
of data from memory stored in contiguous locations (which we call a 
"message" instead of a convoy) is
   time = latency + w/bandwidth 

More generally, to read or write w words stored in m separate contiguous
messages costs m*latency + w/bandwidth.

Notation: We will write this cost as m*alpha + w*beta, and refer to it as the
          cost of "communication".  
          We refer to w as #words_moved and m as #messages.
          We also let gamma = time per flop, so if an algorithm does f flops
          our estimate of the total running time will be
          time = m*alpha + w*beta + f*gamma

We claim that 
  (1) gamma << beta << alpha on modern machines (factors of 100s for memory,
         even bigger for disk or sending messages between processors)
  (2) It is only getting worse over time: gamma is improving ~60%/year,
          beta for memory about 23%/year, alpha for memory about 7%/year
So when we design or choose algorithms, need to pay attention to their 
communication costs. 

How do we tell if we have designed the algorithm well wrt not communicating 
too much?  If the f*gamma term in the time dominates the communication cost 
m*alpha + w*beta (f*gamma >= m*alpha + w*beta) then the algorithm is 
at most 2x slower than as though communication were free. 
When f*gamma >> m*alpha + w*beta, the algorithm is running at near 
peak arithmetic speed, and can't go faster.
When f*gamma << m*alpha + w*beta, communication costs dominate.

Notation: the Computational Intensity = q = f/w = "flops per word_moved"
This needs to be large to go fast, since f*gamma > w*beta means
q = f/w > beta/gamma >> 1.

We will know we have done as well as possible if we can show that
w and m are close to known lower bounds, that we will describe below.

But first to describe how this trend has impacted algorithms over time,
we give a little history.

In the beginning, was the do-loop. This was enough for the first
libraries (EISPACK, for eigenvalues of dense matrices, in the mid 1960s)
People didn't worry about data motion, arithmetic was more expensive,
they mostly just tried to get answer reliably, for O(n^3) flops.

BLAS-1 library (mid 1970s) - Basic Linear Algebra Subroutines
This was a standard library of 15 operations (mostly) on vectors, including:
(1)  y = alpha*x + y, x and y vectors, alpha scalar ("AXPY" for short)
   Ex: innermost loop in Gaussian Elimination (GE):
       for k=i+1:n, A(j,k) = A(j,k) - A(j,i)*A(i,k)
(2)  dot product
(3)  2-norm(x) = sqrt( sum_i x(i)^2 )
(4)  find largest entry in absolute values in a vector (for pivoting in GE)
The motivations for the BLAS1 at the time were:
 easing programming, readability, since these were commonly used functions
 robustness (eg avoid over/underflow in 2-norm(x))
 portable and efficient (if optimized on each architecture)
But there is no way to minimize communication in such simple operations,
because they do only a few flops per memory reference,  eg 2n flops on 
2n data for a dot product, so with a computational intensity of q=2n/(2n)=1.
So if we implement matmul or GE or anything else by loops calling AXPY, 
it will communicate a lot, the same as the number of flops.

BLAS-2 library (mid 1980s)
This was a standard library of 25 operations (mostly) on matrix-vector pairs
(1) y = alpha*y + beta*A*x ("GEMV" for short),
    with lots of variations for different matrix structures
    (symmetric, banded, triangular etc) 
    it allowed A^T, A^* to be used instead of A,
    and did obvious optimizations when alpha and/or beta is +1, -1 or 0
(2) A = A + alpha*x*y^T (matrix + rank-1 update) ("GER" for short)
    Ex:  It turns out the 2 innermost loops in GE can be written with GER:
         A(i+1:n,i+1:n) = A(i+1:n,i+1:n) - A(i+1:n,i)*A(i,i+1:n)
(3) Solve T*x=b where T triangular (used by GE to solve A*x=b) ("TRSV")
The motivation was similar to the BLAS-1, plus more opportunities to optimize
on the vector computers of the day. But there is still not much reduction in
communication, for example, GEMV reads n^2+2*n+2 words, writes n words,
and does 2n^2+3*n flops,  for a computational intensity of about q=2.

BLAS-3 library (late 1980s)
This was a standard library of 9 operations on matrix-matrix pairs
(1) C = beta*C + alpha*A*B ("GEMM" for short)
(2) C = beta*C + alpha*A*A^T, C symmetric ("SYRK" for short)
(3) Solve T*X=B where T triangular, X and B matrices ("TRSM" for short)
Finally these have the potential for significant communication optimization,
since for example the computational intensity q of GEMM applied to n x n
matrices is (2n^3 flops)/(3n^2 inputs + n^2 outputs) = n/2.
But as we'll see, the straightforward 3-nested-loop version of GEMM is
no better than BLAS-2 or BLAS-1, so a different implementation is required,
with very large speedups possible. In fact, there is an optimal way to
implement GEMM, that provably does as little communication as possible
(under certain assumptions).

Note: BLAS-k does O(n^k) operations, for k=1,2,3, making them easier 
to remember.  These are supplied as part of the optimized math libraries 
on essentially all computers: use them as your building blocks!
Reference implementations and documentation at www.netlib.org/blas

This led the community to seek algorithms for the rest of linear algebra
(solving Ax=b, least squares, computing eigenvalues, the SVD, etc)
that did much of their work by calling BLAS-3 routines, which would then
run at high speed. This led to the LAPACK library (and its parallel
version called ScaLAPACK) which is the basis of most dense linear algebra
libraries provided by computer vendors, and used in many packages like
Matlab.

But more recently it was discovered that the communication lower bounds
attained by GEMM also apply to essentially the rest of dense linear algebra,
basically anything that sufficiently resembles the 3-nested-loops of GEMM.
Next it was realized that the algorithms in LAPACK and ScaLAPACK usually
did not attain these lower bounds, in fact the algorithms there could do
asymptotically more communication than the bounds required. This in turn
set off a search for new (or previously invented but ignored) algorithms
that would attain these lower bounds. This has led to a number of new
algorithms, some with large speedups, and some still of just theoretical
interest; the constants hidden in a O(.) analysis are important in practice!

We will talk about GEMM in more detail, and sketch results for other
algorithms.

The following theorem for matrix multiplication was first proved in
1981 by Hong and Kung. The proof below is based on one by Irony, Tiskin
and Toledo in 2004, which extends both to parallel algorithms, and to
other 3-nested-loop-like algorithms:

Thm: Suppose one wants to multiply two nxn matrices C = A*B by doing the
usual 2*n^3 multiplies and adds, on a computer with a main memory large 
enough to hold A, B and C, and a smaller cache of size M. Arithmetic
can only be done on data stored in cache. Then a lower bound on the number
of words W that need to move back and forth between main memory and
cache to run the algorithm is Omega(n^3/sqrt(M)).

More generally, one does not need to multiply dense square matrices, the
same proof works for rectangular and/or sparse matrices; the only change
is that Omega(n^3/sqrt(M)) becomes Omega(#flops/sqrt(M)).

Proof Sketch: Suppose we fill up the cache with M words of data, do as many 
flops as possible, store the results back in main memory, and repeat until 
we are done. Suppose we could upper bound by G the number of flops that are 
possible given M words of data. Then doing G flops would cost at least 2M 
words moved back and forth between main memory and cache. Since we have to 
do 2n^3 flops altogether, we would need to repeat this at least 2n^3/G times,
for a total cost of moving at least (2n^3/G)*2M words. So we need to find 
an upper bound G.

We turn this into a geometry problem as follows. We represent each
pair of flops (or inner loop iteration) C(i,j) = C(i,j) + A(i,k)*B(k,j)
as a lattice point (i,j,k) in 3D space, with 1 <= i,j,k <= n, so an 
n x n x n cube of points. The flops we can do with M words of 
data is represented by some subset V of this cube of lattice points. 
What data is required to execute the flops represented by (i,j,k)? 
It is C(i,j), A(i,k), and B(k,j), which are represented by the 3 lattice
points (i,j), (i,j) and (k,j) on 3 faces of the cube, i.e. the projections
of (i,j,k) onto these three faces (draw picture). We know we can have
at most M words of data, i.e. M projected points.

We now use a classical theorem of geometry by Loomis and Whitney (1949)
which says that if the set of lattice points is V whose cardinality |V|
we want to bound, and its 3 projections onto the faces of the cube are 
   V_C (representing entries C(i,j))
   V_A (representing entries A(i,k))
   V_B (representing entries B(k,j))
then |V| <= sqrt( |V_A| * |V_B| * |V_C| )
Since we can have at most M entries of A, B and C in cache, this means
   |V| <= sqrt( M * M * M ) = M^(3/2) 
yielding our desired upper bound G = M^(3/2). Finally, this yields
our lower bound W >= (2*n^3)/M^(3/2) * 2M = Omega(n^3/sqrt(M)) as claimed.

We have not tried to be careful with the constant factors, but just tried
to give the main idea.

The proof also gives us a big hint as to how to design an optimal algorithm:
What is the shape of the V that attains its upper bound M^(3/2)? Clearly
V must be a cube of side length sqrt(M). This means that we will want to
break A, B and C up into square submatrices of dimension
at most sqrt(M/3), read 3 submatrices into cache (the factor of
3 means we can fit 3 such submatrices simultaneously), multiply them,
and write the results back to main memory:

... Multiply n x n matrices C = C+A*B 
... Let b be block size, small enough so 3*b^2 <= M
... express C as a block matrix where C[i,j] is a b x b block (picture)
... ditto for A[i,k] and B[k,j]
    for i = 1 to n/b  ... number of b x b blocks in each dimension
      for j = 1 to n/b
        read C[i,j] into cache  ... b^2 words moved
        for k = 1 to n/b
          read A[i,k] and B[k,j] into cache  ... 2*b^2 words moved
            ... all 3 b x b submatrices fit in cache since 3*b^2 <= M
          C[i,j] = C[i,j] + A[i,k]*B[k,j]
            ... b x b matrix multiplication, so 3 more nested loops
            ... since everything in cache, no communication
        end for
        write C[i,j] back to main memory  ... b^2 words moved

Counting the total number of words moved between cache and memory, we get
   (n/b)^2*b^2 = n^2   for reading C[i,j]
   (n/b)^3*2*b^2 = 2*n^3/b for reading A[i,k] and B[k,j]
   (n/b)^2*b^2 = n^2   for writing C[i,j]
for a total of 2*n^3/b + 2*n^2 words moved. We minimize this by making
b as large as possible, which is limited by 3*b^2 <= M, i.e. b = sqrt(M/3),
or the number of words moved = O(n^3/sqrt(M)), which attains the
lower bound.

Here is another optimal algorithm, that we will find simpler to generalize
to other linear algebra problems, because it is "cache oblivious",
i.e. it works for any cache size M, without needing to know M:

    function C = RMM(A,B)  ... Recursive Matrix Multiplication
       ... assume for simplicity that A and B are square of size n
       ... assume for simplicity that n is a power of 2
       if n=1
          C = A*B  ... scalar multiplication
       else
          ... write A = [ A11  A12 ], where each Aij is n/2 by n/2
                        [ A21  A22 ]
          ... write B and C similarly
          C11 = RMM(A11,B11) + RMM(A12,B21)
          C12 = RMM(A11,B12) + RMM(A12,B22)
          C21 = RMM(A21,B11) + RMM(A22,B21)
          C22 = RMM(A21,B12) + RMM(A22,B22)
       endif
       return

    Correctness follows by induction on n
    Cost:  let A(n) = #arithmetic operation on matrices of dimension n
    A(n) = 8*A(n/2)   ... 8 recursive calls
           + n^2      ... 4 additions of n/2 x n/2 matrices
    and A(1) = 1 is the base case
    Claim that A(n) = 2n^3 - n^2, same as the classical algorithm

    Solve by changing variables from n=2^m to m:
      a(m) = 8*a(m-1) + 2^(2m),    a(0) = 1
    Divide by 8^m to get
      a(m)/8^m = a(m-1)/8^(m-1) + (1/2)^m
    Change variables again to b(m) = a(m)/8^m
      b(m) = b(m-1) + (1/2)^m,   b(0) = a(0)/8^0 = 1
           = sum_{k=1 to m) (1/2)^k + b(0)
    a simple geometric sum

   Let W(n) = #words moved between slow memory and cache of size M
       W(n) = 8*W(n/2)   ... 8 recursive calls
              + 12(n/2)^2  ... up to 8 reads and 4 writes of n/2 x n/2 blocks
            = 8*W(n/2) + 3n^2
       oops - looks bigger than A(n), rather than A(n)/sqrt(M)
       But what is base case?
           W(b) = 3*b^2 if 3*b^2 <= M,  i.e. b = sqrt(M/3)
       The base case is not W(1) = something
       Solve as before: change variables from n=2^m to m:
           w(m) = 8*w(m-1) + 3*2^(2m),   w(log2 sqrt(M/3)) = M
       Let mbar = log2 sqrt(M/3)
       Divide by 8^m to get v(m) = w(m)/8^m
           v(m) = v(m-1) + 3*(1/2)^m,
               v(mbar) = M/8^mbar = M/(M/3)^(3/2) = 3^(3/2)/M^(1/2)
       This is again a geometric sum but just down to m = mbar, not 1
           v(m) = sum_{k=mbar+1 to m} 3*(1/2)^k + v(mbar)
                <= 3* (1/2)^(mbar) + v(mbar)
                 = 3/sqrt(M/3) + 3^(3/2)/sqrt(M)
                 = 2*3^(3/2)/sqrt(M)
       so
           W(n) = w(log2 n) = 8^(log2 n)*v(log2 n) <= 2*3^(3/2)*n^3/sqrt(M)
       as desired.

We mention briefly that the lower bound extends to the parallel case
in a natural way: now the "cache" is the memory local to each processor,
that it can access quickly, and "main memory" is the memory local to
all the other processors, which is much slower to access. We assume
each processor is assigned an equal fraction of the flops to do, 2*n^3/P
where P is the number of processors, and an equal fraction of the memory
required to store all the data, so M = 3*n^2/P. Then the same proof as above
says that for each processor to perform 2*n^3/P flops it needs to move
Omega((n^3/P)/sqrt(M)) = Omega(n^2/sqrt(P)) words into and out of its
local memory. This lower bound is indeed attained by known algorithms, 
such as SUMMA, discussed in more detail in CS267.

One more topic: We can go asymptotically faster than n^3, not just to
multiply matrices, but any linear algebra problem. There are many such
algorithms, we discuss only the first one, which so far is the most
practical one:

Strassen (1967): Matrix multiply is possible in O(n^2.81) = O(n^(log2 7))
operations.  The algorithm is recursive, based on remarkable identities
for multiplying 2 x 2 matrices:
     Write A = [ A11 A12 ], each block of size n/2 x n/2, same for B and C
               [ A21 A22 ]
     P1 = (A12-A22)*(B21+B22)
     P2 = (A11+A22)*(B11+B22)
     P3 = (A11-A21)*(B11+B12)
     P4 = (A11+A12)*B22
     P5 = A11*(B12-B22)
     P6 = A22*(B21-B11)
     P7 = (A21+A22)*B11
     C11 = P1+P2-P4+P6
     C12 = P4+P5
     C21 = P6+P7
     C22 = P2-P3+P5-P7

   Changes naturally into recursive algorithm:
      function C = Strassen(A,B)
      ... assume for simplicity that A and B are square of size n
      ... assume for simplicity that n is a power of 2
      if n=1 
         C = A*B  ... scalar multiplication
      else
         P1 = Strassen(A12-A22,B21+B22) ... and 6 more similar lines
         C11 = P1+P2-P4+P6 ... and 3 more similar lines
      end

   A(n) = #arithmetic operations
        = 7*A(n/2)  ... 7 recursive calls
          + 18*(n/2)^2  ... 18 additions of n/2 x n/2 matrices

   We solve as before: change variables from n=2^m to m, A(n) to a(m):
      a(m) = 7*a(m-1) + (9/2)*2^(2m),    a(0)=A(1)=1
   divide by 7^m, change again to b(m) = a(m)/7^m
      b(m) = b(m-1) + (9/2)*(4/7)^m,   b(0)=a(0)=1
           = O(1)  ... a convergent geometric sequence
   so
      A(n) = a(log2 n) = b(log2 n)*7^(log2 n) = O(7^(log2 n)) = O(n^(log2 7))

   What about #words moved? Again get a similar recurrence:
      W(n) = 7*W(n/2) + O(n^2), and the base case W(nbar) = O(nbar^2) when
        nbar^2 = O(M), ie. whole problem fits in cache.
      Solution is O(n^x/M^(x/2 -1)) where x = log2(7).
      Note that plugging in x=3 gives the lower bound for standard matmul

Again, there is a theorem (with a very different proof), showing that
this algorithm is optimal:
Thm (D., Ballard, Holtz, Schwartz, 2010): Omega(n^x/M^(x/2 - 1)) is a
lower bound on #words_moved for Strassen and algorithms "enough like" 
Strassen, where x = log2 7 for Strassen, and whatever it is for 
"Strassen-like" methods (lots of open questions).

Strassen is not (yet) much used in practice, but it can pay off for n not 
too large (a few hundred).  
Its error analysis slightly worse than usual algorithm:
           Usual Matmul (Q1.10):  |fl(A*B) - (A*B)| <= n*eps*|A|*|B|
           Strassen   ||fl(A*B) - (A*B)|| <= O(eps)*||A||*||B||
which is good enough for lots of purposes. But when can Strassen's be much 
worse? (suppose one row of A, or one column of B, is very tiny).
Strassen is not allowed to be used in "LINPACK Benchmark" (www.top500.org)
which ranks machines by speed = (2/3)*n^3 / time_for_GEPP(n).
and if you use Strassen (we'll see how) in GEPP, it does << (2/3)n^3 flops
so computed "speed" could exceed actually peak machine speed.  As a result, 
vendors stopped optimizing Strassen, even though it's a good idea!

Current world's record: Le Gall (2014): O(n^2.3728639)
but it is impractical: would need huge n to pay off.
All such algorithms were shown numerically stable (in an asymptotic 
norm-sense) by D., Dumitriu, Holtz, Kleinberg (2007).
Given any fast matmul running in O(n^w) operations, it is also possible
to do the rest of linear algebra (solve Ax=b, least square, eigenproblems,
SVD) in O(n^(w+eta)) for any eta>0, and to do so numerically stably,
shown by D., Dumitriu, Holtz (2007)

A similar divide and conquer idea can be used to make complex matrix
multiplication a constant factor cheaper than you think. Normally, to 
multiply two complex numbers or matrices, you use the formula
    (A+i*B)*(C+i*D) = (A*C-B*D)+i*(A*D+B*C)
costing 4 real multiplications and 2 real additions. Here is another way:
    T1 = A*C
    T2 = B*D
    T3 = (A+B)*(C+D)
Then it turns out that
    (A+i*B)*(C+i*D) = (T1-T2) + i*(T3-T1-T2)
for a cost of 3 real multiplications and 5 real additions.
But since matrix multiplication costs O(n^3) or O(n^x) for some x>2,
and addition costs O(n^2), for large n the cost drops by a factor 3/4.
The error analysis is very similar to the usual algorithm.

Applying the above formula recursively in a different context
yields an algorithm for multiplying two n-bit integers in 
O(n^log2(3)) = O(n^1.59) bit operations, as opposed to the conventional 
O(n^2).  This topic is taught in classes like CS170.