Notes for Math 221, Lecture 6, Sep 14 2010 Goal: Explore different ways to multiply matrices as fast as possible Why matmul? Easiest to understand, (nearly) all ideas extend to all other algorithms we study (which we won't have time to explain in as much detail) How could there be different algorithms that run at different speeds? (1) Two algorithms may do the same number of flops (say 2n^3 for matmul) but move different amounts of data between parts of the computer memory, which dominates time; we want the algorithm that moves as little data as possible (2) On a parallel computer with P processors, need to give each processor (1/P)th of the work, again without moving too much data between processors (3) Some matmul algorithms use O(n^3) flops, some O(n^2.81), some less... (world record currently O(n^2.376), but not practical, because the constant hidden in the O(.) is enormous) We want a simple mathematical model of what it costs to move data, so we can analyze algorithms easily and identify the one that is fastest. So we need to define two terms: bandwidth and latency: First, to establish intuition, consider the time it takes to move cars on the freeway from Berkeley to Sacramento: The bandwidth measures how many cars/hour can get from Berkeley to Sacramento: #cars/hour = density (#cars/mile in a lane) x velocity(miles/hour) x #lanes The latency measures how long it takes for one car to get get from Berkeley to Sacramento: time(hours) = distance(miles) / velocity(miles/hours) So the minimum time it takes n cars to go from Berkeley to Sacramento is when they all travel in a single "convoy," that is all as close together as possible given the density, and using all lanes: time(hours) = time for first car to arrive + time for remaining cars to arrive = latency + n/bandwidth Now consider the similar situation of reading bits of data from disk, in which the read head must move toward or away from the center of disk to line up with the circle on which the data lies, then wait for the disk to spin around so the data is underneath the read head, and then read the data: The bandwidth measures how many bits/second can be read: #bits/second = density (#bits/cm) x velocity(cm/sec) x #read_heads The latency measures the (average) time to read one bit: time(seconds) = average distance to move read head(cm) / velocity (cm/sec) + average distance to spin disk (.5 revolution) / velocity (revolutions/sec) So the minimum time to read n bits from disk is when they are all packed together on one circle, and = latency + n/bandwidth (on average) Same idea (harder to explain the physics) for reading bits from main memory to cache (draw picture): The data resides in main memory initially, but one can only perform arithmetic and other operations on it after moving it to the much smaller cache. The time to read or write w words of data from memory stored in contiguous locations (which we call a "message" instead of a convoy) is time = latency + w/bandwidth More generally, to read or write w words stored in m separate contiguous messages costs m*latency + w/bandwidth. Notation: We will write this cost as m*alpha + w*beta, and refer to it as the cost of "communication". We refer to w as #words_moved and m as #messages. We also let gamma = time per flop, so if an algorithm does f flops our estimate of the total running time will be time = m*alpha + w*beta + f*gamma We claim that (1) gamma << beta << alpha on modern machines (factors of 100s for memory, even bigger for disk or sending messages between processors) (2) It is only getting worse over time: gamma is improving ~60%/year, beta for memory about 23%/year, alpha for memory about 7%/year So when we design or choose algorithms, need to pay attention to their communication costs. How do we tell if we have designed the algorithm well wrt not communicating too much? If the f*gamma term in the time dominates the communication cost m*alpha + w*beta (f*gamma >= m*alpha + w*beta) then the algorithm is at most 2x slower than as though communication were free. When f*gamma >> m*alpha + w*beta, the algorithm is running at near peak arithmetic speed, and can't go faster. When f*gamma << m*alpha + w*beta, communication costs dominate. Notation: computational intensity = q = f/w = "flops per word_moved" This needs to be large to go fast, since f*gamma >= w*beta means q = f/w >= beta/gamma >> 1. Now we can state a sequence of questions we will ask, and answer, about n-by-n dense matrix multiplication C=A*B. (The same questions will be asked of all the other algorithms of the course, more on that later). (1) What is simplest (sequential) algorithm, and how much does it cost in f=#flops? Assuming the data initially resides in a main memory, and that any data that participates in arithmetic has to moved to a smaller cache of size M, and that we want the final answer to end up back in main memory, how many words w does this simplest algorithm move between main memory and cache? Is this algorithm fast enough? Answer: The usual algorithm consists of 3 nested loops and does 2n^3 flops: for i=1 to n, for j=1 to n, for k=1 to n, C(i,j) += A(i,k)*B(k,j); There are two cases: when all 3 matrices fit in cache, and when they don't. If the 3 matrices do fit in cache (3*n^2 <= M), then this algorithm will bring each entry A(i,k) and B(k,j) into cache the first time they are needed, finish computing C without any more need to access memory, and then write C back to memory, for a total of 2n^2 reads from and n^2 writes to memory, or w=3n^2 in all. Clearly this is minimal. But if the matrices do not all fit, i.e. 3n^2 is at least several times larger than M, then about n^3 words are moved. This algorithm is not fast, because the time to run, time = 2n^3*gamma + n^3*beta, is dominated by communication n^3*beta >> 2*n^3*gamma, even ignoring latency. Another way to say the same thing is to note that the computational intensity q = f/w = 2. (2) Is there an algorithm that moves fewer words? How many? Answer: There are (at least 2): "blocked" matrix multiply, and "recursive" matrix multiply, to be presentated later: When n^2 is larger than M, either algorithm moves just w=O(n^3/sqrt(M)) words. Ex: An Intel Nehalem processor has a M = 32 KByte = 4K double-precision-word level-1 (L1) cache (the smallest and fastest of several caches). Then sqrt(M) = sqrt(4K) = 64, so the cost of communication (# words moved) is reduced by a factor of 64 by these algorithms. And these algorithms start paying off when your matrices are about 64 x 64 or larger. (3) Can we do better? Answer: No: Theorem (Hong/Kung, 1981; Irony/Tiskin/Toledo, 2004) #words_moved = Omega(n^3/sqrt(M)) is a lower bound for any algorithm that does the usual 2n^3 flops to perform matmul. (4) What if my computer has multiple levels of cache, as all real computers do? (picture, office analogy) Answer: We can apply the same ideas recursively, and minimize the #words_moved between all possible levels of memory Ex: The same Intel Nehalem has a level-2 (L2) cache of size M = 256KB = 32K double-precision-words, so sqrt(M) = 181, so it is possible to reduce #words_moved between main memory and L2 by factor of about 181x, when the matrices are about 181 x 181 or larger. (5) Do these algorithms also minimize the number of messages? Answer: Not unless you use a different data structure to store your matrices in memory, storing them as blocks instead of the usual rowwise format (as in C) or columnwise format (as in Fortran): you need to use a blocked data structure (or recursively blocked, if there are multiple levels of cache). If you use the the right data structure, you can attain the lower bound #messages = Omega(n^3/M^(3/2)) (6) How fast can we hope that parallel matmul could run on P processors? Answer: Each processor should compute as little as possible (2n^3/P flops, if the 2n^3 flops are divided evenly) and communicate as little as possible: Theorem: #words_moved = Omega(n^2/sqrt(P)) and #messages = Omega(sqrt(P)) are lower bounds on the communication costs. So time >= (2n^3/P)*gamma + c1*n^2/sqrt(P)*beta + c2*sqrt(P)*alpha for some constants c1 and c2. (7) Is there a parallel Matmul algorithm that runs this fast? Answer: Yes: Cannon's algorithm, to be discussed later. (8) Can we do matmul faster than 2n^3 flops? And keep it numerically stable? Are these algorithms worth using? Answer: Yes: Strassen's method takes O(n^(log2 7)) ~ O(n^2.81) flops The world's record (Coppersmith/Winograd) is O(n^2.376), but impractical None of these algorithms are quite as stable as the usual 2n^3 algorithm, but are good enough (or can be modified to be good enough). Strassen's method is the only one where the constant hidden in the O(.) is small enough to be practical. Alas, the rules for the Linpack Benchmark for ranking the 500 fastest computers in the world (see www.top500.org), which measures how fast computers can do Gaussian Elimination on a large dense matrix, do not permit Strassen to be used, so relatively little effort goes into tuning Strassen's method to run as fast as possible, compared to the usual 2n^3 algorithm. (9) Do we know how to minimize #words_moved for algorithms like Strassen? Answer: We think so (paper submitted, still many open problems) (10) Is this all I need to know to write a fast matmul algorithm? Answer: No, lots of computer-specific details to get right (see graphs at www.cs.berkeley.edu/~volkov/cs267.sp09/hw1 www.cs.berkeley.edu/~volkov/cs267.sp09/hw1/results for evidence). These are not a topic of this course. If you are interested, take CS267 next semester. (11) So are there good implementations I can just use, and not worry about the details? Answer: Yes: Matrix multiply especially has been tuned for most computers by their vendors and is available in their math libraries. We won't have time to answer all 11 questions in detail for every linear algebra problem, such as solving linear systems, least squares, eigenvalue problems, the SVD, etc., but will try to sketch what is known: The answers are all roughly the same (papers submitted and in progress), but there are some important differences and extensions: (1) The lower bound applies to all algorithms, for dense or sparse matrices, sequential or parallel. For sparse matrices, the numerator n^3 needs to be replaced by #flops actually performed, which is usually much smaller than n^3, so that #words_moved = Omega(#flops/sqrt(M)) and #messages = Omega(#flops/M^(3/2)) (2) There are algorithms for most problems that attain these communication lower bounds for dense matrices. A few of them, like matmul, do the same operations as the conventional algorithm but just in a different order, but most of them use mathematically different algorithms than their conventional counterparts. Some of these mathematically different algorithms do about the same number of flops as their conventional counterparts, but others are quite different, and do several times as many flops (but are still O(n^3)). Finding algorithms for all linear algebra problems that minimize communication but only do a little more arithmetic is an open problem. (3) Computer vendors often have tuned dense linear algebra libraries in their math libraries, typically versions of LAPACK and sometimes ScaLAPACK. For example Intel has a special team for tuning its math libraries. However the algorithms that minimize communication are too recent to be in these libraries, and we have ongoing projects to incorporate them. Prototypes of the new algorithms are significantly faster in some cases. Implementing some of these could constitute class projects. Such projects could be oriented toward getting the fastest implementation, and others could just implement the algorithms in a language like Matlab, and explore their numerical properties without worrying about performance, since some may have interestingly different numerical properties than their conventional counterparts. (4) For sparse matrices, very few algorithms are known that attain the communication lower bounds. There are many open questions. Now we go back to the beginning of the list of questions, and consider How to implement Matmul to minimize data movement: First look at naive implementation, what's wrong with it, then good one: ... multiply C = C + A*B for i = 1:n for j = 1:n for k = 1:n C(i,j) = C(i,j) + A(i,k)*B(k,j) Need to "decorate" code with explicit descriptions of data movement (we assume we control it, for simplicity, even though hardware makes the decisions about moving data in real caches). The answer will depend on size of fast memory, call it M. Easy case: A, B and C all fit in fast memory (i.e. 3*n^2 <= M), ... load all of A, B, C into fast memory multiply them using above algorithm ... store answers #words_moved = 4*n^2 (= lower bound = #inputs + #outputs) Interesting case: when matrices don't all fit (3*n^2 > M) Simplest way to do it: load data just before you need it for i = 1:n for j = 1:n ... load C(i,j) for k = 1:n ... load A(i,k), B(k,j) C(i,j) = C(i,j) + A(i,k)*B(k,j) ... store C(i,j) #words_moved = 2n^3 + n^2, so computational intensity = q = 2n^3/(2n^3 + n^2) ~ 1, not large enough to be fast (like beta/gamma) A little better, but not much: assume row of A and of C, both fit in fast memory (2*n <= M) for i = 1:n ... load whole rows A(i,:), C(i,:) for j = 1:n for k = 1:n ... load B(k,j) C(i,j) = C(i,j) + A(i,k)*B(k,j) ... store row C(i,:) #words_moved = n^3 + 3*n^2, computational intensity ~ 2, still not large enough to go fast Notation: think of A, B, C as collection of b x b subblocks, so that C[i,j] is b x b, numbered from C[1,1] to C[n/b,n/b] (picture) for i=1:n/b for j=1:n/b for k=1:n/b C[i,j] = C[i,j] + A[i,k]*B[k,j] ... b x b matrix multiplication Suppose that 3 bxb blocks fit in fast memory (3*b^2 <= M), then we can move data as follows: for i=1:n/b for j=1:n/b ... load block C[i,j] for k=1:n/b ... load blocks A[i,k] and B[k,j] C[i,j] = C[i,j] + A[i,k]*B[k,j] ... b x b matrix multiplication ... store block C(i,j) #words_moved = 2*(n/b)^2 * b^2 ... for C + 2*(n/b)^3 * b^2 ... for A and B = 2*n^2 + 2*n^3/b Much better: dominant term smaller by a factor of b computational intensity = q = 2*n^3/(2*n^2 + 2*n^3/b) ~ b We want b as large as possible to minimize #words_moved. How large can we make it? 3*b^2 <= M so #words_moved >= 2*n^2 + sqrt(3)*2*n^3/sqrt(M) = Omega(n^3/sqrt(M)) computational intensity = q = O(sqrt(M)) Theorem: no matter how we reorganize matmul, as long as we do the usual 2*n^3 arithmetic operations, but possibly just doing the sums in different orders, then #words_moved >= Omega(n^3/sqrt(M)) History: Hong, Kung, 1981 Irony, Tiskin, Toledo, 2004 (parallel case too) Ballard, D., Holtz, Schwartz, 2009 (rest of linear algebra) Proof (ITT): Think of your matmul program as doing some sequence of instructions: add, mul, load, store, ... Arithmetic like add and mul only uses data in fast memory Load and store move data between fast and slow memory Now just look at all the load/store instructions, say there are W (=#words_moved) of them. We want a lower bound on W. Let S(1) be the set of all instructions from the start up to the M-th load/store instruction Let S(2) be the set of all instructions after S(1), up to the 2M-th load/store instruction ... Let S(i) be the set of all instructions after S(i-1), up to the i*M-th load/store instruction ... Let S(r) be the last set of instructions (which may have fewer than M load/store instructions, if M does not divide W evenly) Now (r-1)*M <= W, so a lower bound on r gives a lower bound on W. Suppose we had an upper bound F on the number of adds & muls that could be in any S(t). Since we need to do 2*n^3 adds & muls, that tells us that r*F >= 2*n^3 or r >= 2*n^3/F, or W >= (r-1)*M >= (2*n^3/F -1)*M is a lower bound. How do we get an upper bound F? We need to ask what adds and muls we can possibly do in one set S(t). We will bound this by using the fact that during S(t), only a limited amount of data is available in fast memory on which to do arithmetic: There are the M words in fast memory when S(t) starts, eg up to M different entries of A, B, partially computed entries of C We can do as many as M more loads during S(t) eg up to M more entries of A, B, partially computed entries of C We can do as many as M stores during S(t) eg up to M more partially or fully computed entries of C Altogether, there are at most 3*M words of data on which to do adds&muls How many can we do? Clever idea of ITT: represent possible data by 3 faces of an nxnxn cube each A(i,k) represented by 1x1 square on one face each B(k,j) represented by 1x1 square on another, not parallel, face each C(i,j) represented by 1x1 square on another, not parallel, face and pair of operations C(i,j) = C(i,j) + A(i,k)*B(k,j) by 1x1x1 cube with corresponding projections on each face Counting problem: we know the total number of squares on each face that are "occupied" is at most 3*M. How many cubes can these "cover"? (Picture) Easy case: occupied C(i,j) fill up XxY rectangle occupied A(i,k) fill up XxZ rectangle occupied B(k,j) fill up ZxY rectangle "covered cubes" occupy brick of dimension XxYxZ, so at most X*Y*Z of them General case: what if we don't have rectangles and bricks? Note that volume=XYZ can also be written XYZ = sqrt((X*Y)*(X*Z)*(Z*Y)) = sqrt((#C entries)*(#A entries)*(#B entries)) Lemma (Loomis&Whitney, 1949) Let S be any bounded 3D set Let S_x be projection ("shadow") of S on y,z - plane (i.e. if (x,y,z) in S then (y,z) in S_x Similarly, let S_y and S_z be shadows on other planes Then Volume(S) <= sqrt(area(S_x)*area(S_y)*area(S_z)) Apply to our case: S = set of 1x1x1 cubes representing all mul/add pairs = {cubes centered at (i,j,k) representing C(i,j) = C(i,j) + A(i,k)*B(k,j)} S_x = set of entries of B needed to perform these = {squares centered at (k,j) projected from C(i,j) = C(i,j) + A(i,k)*B(k,j)} S_y = set of entries of A needed to perform them = {squares centered at (i,k) projected from C(i,j) = C(i,j) + A(i,k)*B(k,j)} S_z = set of entries of C needed to perform them = {squares centered at (i,j) projected from C(i,j) = C(i,j) + A(i,k)*B(k,j)} Lemma => volume(S) = #cubes in S = #flops/2 <= sqrt(area(S_x)*area(S_y)*area(S_z)) = sqrt((#C entries)*(#A entries)*(#B entries)) <= sqrt((3*M)*(3*M)*(3*M)) or #flops = F <= 6*sqrt(3)*M^(1.5) Finally #words_moved = W >= (r-1)*M >= (2*n^3/F -1)*M >= (2*n^3/(6*sqrt(3)*M^(1.5)) -1)*M = (1/sqrt(27))*n^3/sqrt(M) - M = Omega(n^3/sqrt(M)) as desired This complete the proof of the lower bound. Here is another algorithm that also attains the lower bound, that will use a pattern that will appear later: recursion function C = RMM(A,B) ... recursive matrix multiplication ... assume for simplicity that A and B are square of size n ... assume for simplicity that n is a power of 2 if n=1 C = A*B ... scalar multiplication else ... write A = [ A11 A12 ], where each Aij n/2 by n/2 [ A21 A22 ] ... write B and C similarly C11 = RMM(A11,B11) + RMM(A12,B21) C12 = RMM(A11,B12) + RMM(A12,B22) C21 = RMM(A21,B11) + RMM(A22,B21) C22 = RMM(A21,B12) + RMM(A22,B22) endif return Correctness follows by induction on n Cost: let A(n) = #arithmetic operation on matrices of dimension n A(n) = 8*A(n/2) ... 8 recursive calls + n^2 ... 4 additions of n/2 x n/2 matrices and A(1) = 1 is the base case Solve by changing variables from n=2^m to m: a(m) = 8*a(m-1) + 2^(2m), a(0) = 1 Divide by 8^m to get a(m)/8^m = a(m-1)/8^(m-1) + (1/2)^m Change variables again to b(m) = a(m)/8^m b(m) = b(m-1) + (1/2)^m, b(0) = a(0)/8^0 = 1 = sum_{k=1 to m) (1/2)^k + b(0) a simple geometric sum b(m) = 2 - (1/2)^m -> 2 or A(n) = a(log2 n) = b(log2 n)*8^(log2 n) = b(log2 n)*n^3 -> 2n^3 as expected Ask: Why isn't it exactly 2n^3? A(n) = b(log2 n)*n^3 = (2 - 1/n)*n^3 = 2n^3 - n^2 Let W(n) = #words moved between slow memory and cache of size M W(n) = 8*W(n/2) ... 8 recursive calls + 12(n/2)^2 ... up to 8 reads and 4 writes of n/2 x n/2 blocks = 8*W(n/2) + 3n^2 oops - looks bigger than A(n), rather than A(n)/sqrt(M) But what is base case? W(b) = 3*b^2 if 3*b^2 <= M, i.e. b = sqrt(M/3) The base case is not W(1) = something Solve as before: change variables from n=2^m to m: w(m) = 8*w(m-1) + 3*2^(2m), w(log2 sqrt(M/3)) = M Let mbar = log2 sqrt(M/3) Divide by 8^m to get v(m) = w(m)/8^m v(m) = v(m-1) + 3*(1/2)^m, v(mbar) = M/8^mbar = M/(M/3)^(3/2) = 3^(3/2)/M^(1/2) This is again a geometric sum but just down to m = mbar, not 1 v(m) = sum_{k=mbar+1 to m} 3*(1/2)^k + v(mbar) <= 3* (1/2)^(mbar) + v(mbar) = 3/sqrt(M/3) + 3^(3/2)/sqrt(M) = 2*3^(3/2)/sqrt(M) so W(n) = w(log2 n) = 8^(log2 n)*v(log2 n) <= 2*3^(3/2)*n^3/sqrt(M) as desired. Here is another application of the theorem to the parallel case, where "load" means receive data from another processor's memory, and "store" means send data to another processor's memory. Suppose we have P processors working in parallel to multiply 2 matrices. We would like each processor to do 1/P-th of the arithmetic, i.e. 2n^3/P flops. We want each processor to store at most 1/P-th of all the data (A, B and C), i.e. M = 3*n^2/P. Then the same proof of our theorem says the number of load/stores is at least Omega((2*n^3/P) / sqrt(3*n^2/P)) = Omega(n^2/sqrt(P)) Ask: Why does the proof work, even if a processor is only doing part of the work in matmul? Is there an algorithm that attains this? yes: Cannon's Algorithm (picture of layout and processor grid) (picture of algorithm) assume s = sqrt(P) and n/s are integers let each A(i,j) be an n/s x n/s subblock, numbered from A(0,0) to A(s-1,s-1) (1) for all i=0 to s-1, left-circular-shift row i of A by i (parallel) ... so A(i,j) overwritten by A(i, i+j mod s) (2) for all j=0 to s-1, up-circular-shift column j of B by j (parallel) ... so B(i,j) overwritten by B(i+j mod s, j) for k= 0 to s-1 (sequential) for all i=0 to s-1 and all j=0 to s-1 (all processors in parallel) (3) C(i,j) = C(i,j) + A(i,j)*B(i,j) (4) left-circular-shift each row of A by 1 (5) up-circular-shift each column of B by 1 ... C(i,j) = C(i,j) + A(i, i+j+k mod s)*B(i+j+k mod s, j) (Show slides) Proof of correctness: inner loop adds all C(i,j) = C(i,j) + A(i,k)*B(k,j), just in "rotated" order Arithmetic Cost: from line (3): s*(2*(n/s)^3) = 2*n^3/s^2 = 2*n^3/P, so each processor does 1/P-th of the work as desired Communication Cost (#words moved) from line (1): s*(n/s)^2 = n^2/s words moved from line (2): same from line (4): s*(n/s)^2 = n^2/s words moved from line (5): same total = 4*n^2/s = 4*n^2/sqrt(P), meeting the lower bound Let's consider more complicated timing model: time = #flops*gamma + #words_moved*beta + #messages*alpha where one message is a submatrix: from each of lines (1), (2), (4), (5): s messages, so 4*s = 4*sqrt(P) altogether Since #words_moved <= #messages * maximum_message_size a lower bound on #messages is #messages >= #words_moved / maximum_message_size Here we assume that each processor only stores O(1/P-th) of all data, i.e. memory size is O(n^2/P). This is clearly an upper bound on the maximum_message_size too, so #messages >= #words_moved / O(n^2/P) = Omega(n^2/sqrt(P)) / O(n^2/P) = sqrt(P) and we see that Cannon's Alg also minimizes #messages (up to a constant factor) Note: lots of assumptions (P a perfect square, s|n, starting data in right place, double memory) so practical algorithm (SUMMA) more complicated (in PBLAS library!) sends log(P) times as much data as Cannon One more matmul topic: can go asyptotically faster than n^3 Strassen (1967): O(n^2.81) = O(n^(log2 7)) Recursive, based on remarkable identities: for multiplying 2 x 2 matrices Write A = [ A11 A12 ], each block of size n/2 x n/2, same for B and C [ A21 A22 ] P1 = (A12-A22)*(B21+B22) P2 = (A11+A22)*(B11+B22) P3 = (A11-A21)*(B11+B12) P4 = (A11+A12)*B22 P5 = A11*(B12-B22) P6 = A22*(B21-B11) P7 = (A21+A22)*B11 C11 = P1+P2-P4+P6 C12 = P4+P5 C21 = P6+P7 C22 = P2-P3+P5-P7 Changes naturally into recursive algorithm: function C = Strassen(A,B) ... assume for simplicity that A and B are square of size n ... assume for simplicity that n is a power of 2 if n=1 C = A*B ... scalar multiplication else P1 = Strassen(A12-A22,B21+B22) ... and 6 more similar lines C11 = P1+P2-P4+P6 ... and 3 more similar lines end A(n) = #arithmetic operations = 7*A(n/2) ... 7 recursive calls + 18*(n/2)^2 ... 18 additions of n/2 x n/2 matrices solve as before: change variables from n=2^m to m, A(n) to a(m): a(m) = 7*a(m-1) + (9/2)*2^(2m), a(0)=A(1)=1 divide by 7^m, change again to b(m) = a(m)/7^m b(m) = b(m-1) + (9/2)*(4/7)^m, b(0)=a(0)=1 = O(1) ... a convergent geometric sequence so A(n) = a(log2 n) = b(log2 n)*7^(log2 n) = O(7^(log2 n)) = O(n^(log2 7)) What about #words moved? Again get a similar recurrence: W(n) = 7*W(n/2) + O(n^2), and the base case W(nbar) = O(nbar^2) when nbar^2 = O(M), ie. whole problem fits in cache. Solution is O(n^x/M^(x/2 -1)) where x = log2(7). Note that plugging in x=3 gives the lower bound for standard matmul Thm (D., Ballard, Holtz, Schwartz, 2010): Omega(n^x/M^(x/2 - 1)) is a lower bound on #words_moved for Strassen and algorithms "enough like" Strassen (paper in review, lots of open questions). Strassen is not (yet) much used in practice can pay off for n not too large (a few hundred) error analysis slightly worse than usual algorithm: Usual Matmul (Q1.10): |fl(A*B) - (A*B)| <= n*eps*|A|*|B| Strassen ||fl(A*B) - (A*B)|| <= O(eps)*||A||*||B|| which is good enough for lots of purposes, but when can Strassen's be much worse? (suppose one row of A, or one column of B, very tiny) Not allowed to be used in "LINPACK Benchmark" (www.top500.org) which ranks machines by speed = (2/3)*n^3 / time_for_GEPP(n) and if you use Strassen (we'll see how) in GEPP, it does << (2/3)n^3 flops so computed "speed" could exceed actually peak machine speed As a result, vendors stopped optimizing Strassen, even though it's a good idea! Current world's record Coppersmith & Winograd (1990): O(n^2.376) impractical: would need huge n to pay off Rederived by Cohn, Umans, Kleinberg, Szegedy (2003,2005) using group representation theory All such algorithms shown numerically stable (in norm-sense) by D., Dumitriu, Holtz, Kleinberg (2007) Given any fast matmul running in O(n^w) operations, it is possible to do the rest of linear algebra (solve Ax=b, least square, eigenproblems, SVD) in O(n^(w+eta)) for any eta>0, and to do so numerically stably, by D., Dumitriu, Holtz (2007) A similar divide and conquer idea can be used to make complex matrix multiplication a constant factor cheaper than you think. Normally, to multiply two complex numbers or matrices, you use the formula (A+i*B)*(C+i*D) = (A*C-B*D)+i*(A*D+B*C) costing 4 real multiplications and 2 real additions. Here is another way: T1 = A*C T2 = B*D T3 = (A+B)*(C+D) Then it turns out that (A+i*B)*(C+i*D) = (T1-T2)*(T3-T1-T2) for a cost of 3 real multiplications and 5 real additions. But since matrix multiplication costs O(n^3) or O(n^x) for some x>2, and addition costs O(n^2), for large n the cost drops by a factor 3/4. The error analysis is very similar to the usual algorithm. Applying the above formula recursively in a different context yields an algorithm for multiplying two n-bit integers in O(n^log2(3)) = O(n^1.59) bit operations, as opposed to the conventional O(n^2). This topic is taught in classes like CS170. Final issue, applies to parallel case of all algorithms we discuss, again we illustrate just for matmul: If you are sitting at your machine, and run your program to multiply matrices twice, with the identical inputs, does it necessarily give identical outputs? How might it not? Why should anyone care? Possible class project: experiment with different algorithms that do necessarily always get the same answer, and compare their performance and accuracy with implementations that do not (i.e. are only optimized for speed).