Notes for Ma221 Lecture 1, Aug 29, 2014 Greetings! class URL: www.cs.berkeley.edu/~demmel/ma221_Fall14 Prereqs: good linear algebra, programming experience in Matlab (other languages welcome, but homework assignments will be provided in Matlab) numerical sophistication at the level of Ma128ab desirable Office Hours (subject to change): T 9-10, Th 1-2, F 1-2 First Homework on web page - due Monday Sep 8 Reader - recruiting now! Enrollment: hopefully enough room for everyone No final, will be project instead Status of books in bookstore - should be there Fill out survey - what we cover may depend on what you say you are interested in To motivate the syllabus, we tell a story about a typical office hours meeting: A student says: "I need to solve an n-by-n linear system Ax=b. What should I do?" Professor replies: "The standard algorithm is Gaussian Elimination (GE), which costs (2/3)*n^3 floating point operations (flops)." Student: "That's too expensive." Professor: "Tell me more about your problem." S: "Well, the matrix is real and symmetric, A=A^T." P: "Anything else?" S: "Oh yes, it's positive definite, x^T*A*x > 0 for all nonzero vectors x" P: "Great, you can use Cholesky, it costs only (1/3)*n^3 flops, half as much." The professor also begins to record their conversation in a "decision tree", where each node represents an algorithm, and edge represents a property of the matrix, with arrows pointing to nodes/algorithms depending on the answer. S: "That's still too expensive." P: "Tell me more about your matrix" S: "It has a lot of zeros it, in fact all zeros once you're a distance n^(2/3) from the diagonal." P: "Great, you have a band matrix with bandwidth bw=n^(2/3), so there is a version of Cholesky that only costs O(bw^2*n) = O(n^(7/3)) flops, much cheaper!" S: "Still too expensive." P: "So tell me more." S: "I need to solve the problem over and over again, with the same A and different b, so should I just precompute inv(A) once and multiply by it?" ASK&WAIT: Is this a good idea? P: "inv(A) will be dense, so just multiplying by it costs 2*n^2 flops, but you can reuse the output of Cholesky (the L factor) to solve for each b in just O(bw*n) = O(n^(5/3))". S: "That's still too expensive." P: "Tell me more." S: "There are actually a lot more zero entries, just at most 7 nonzeros per row." P: "Let's think about using an iterative method instead of a direct method, which just needs to multiply your matrix times a vector many times, updating an approximate answer until it is accurate enough." S: "How many matrix-vectors multiplies will I need to do, to get a reasonably accurate answer?" P: "Can you say anything about the range of eigenvalues, say the condition number = kappa(A) = lambda_max / lambda_min? S: "Yes, kappa(A) is about n^(2/3) too." P: "You could use the conjugate gradient method, which will need about sqrt(kappa(A)) iterations, so n^(1/3). With at most 7 nonzeros per row, matrix-vector multiplication costs at most 14n flops, so altogether O(n^(1/3)*n) = O(n^(4/3)) flops. Happy yet?" S: "No." P: "Tell me more." S: "I actually know the largest and smallest eigenvalues, does that help?" P: "You know a lot about your matrix. What problem are you really trying to solve?" ASK&WAIT: S: "I have a cube of metal, I know the temperature everywhere on the surface, and I want to know the temperature everywhere inside." P: "Oh, you're solving the 3D Poisson equation, why didn't you say so! Your best choice is either a direct method using an FFT = Fast Fourier Transform costing O(n log n) flops, or an iterative method called multigrid, costing O(n) flops. And O(n) flops is O(1) flops per component of the solution, you won't do better." S: "And where can I download the software?" ... This illustrates an important theme of this course, exploiting the mathematical structure of your problem to find the fastest solution. The Poisson equation is one of the best studied examples, but the number of interesting mathematical structures is bounded only by the imaginations of people working in math, science, engineering and other fields, who come up with problems to solve, so we will only explore some of the most widely used structures and algorithms in this course. The story above suggests that counting floating point operations is the right metric for choosing the fastest algorithm. In fact others may be much more important. Here are some other metrics: (1) Fewest keystrokes: eg "A\b" to solve Ax=b. More generally, the metric is finding an existing reasonable implementation with as little human effort as possible. We will try to give pointers to the best available implementations, usually in libraries. There are lots of pointers on class webpage (eg netlib, GAMS) (demo). A\b invokes the LAPACK library, which is also used as the basis of the libraries used by most computer vendors, and has been developed in a collaboration by Berkeley and other universities over a period of years. (2) Fewest flops = floating point operations. ASK&WAIT: how many operations does it take to multiply 2 nxn matrices? Classical: 2*n^3 Strassen (1969): O(n^log2 7) ~ O(n^2.81) - sometimes practical Coppersmith/Winograd (1987): O(n^2.376) - not practical Umans/Cohn: O(n^2.376), maybe O(n^2)? reduced to group theory problem (FOCS2003) - not yet practical Williams (2013): O(n^2.3728642) Le Gall (2014): O(n^2.3728639) Demmel, Dumitriu, Holtz (2008): all the other standard linear algebra problems (solve Ax=b, eigenvalues, etc.) have algorithms can be as fast as matmul (and backward stable) - ideas behind some of these algorithms are practical, and will be discussed But counting flops is not the only important metric in today's and future world, for two reasons: (1) ASK&WAIT: who knows Moore's Law? Until 2007: computers doubled in speed every 18 months with no code changes. This has ended, instead the only way computers can run faster is by having multiple processors, so all code that needs to run faster (not just linear algebra!) has to change to run in parallel. This is causing upheaval in the computer industry, and generating lots of research and business opportunities. We will discuss parallel versions of some algorithms, a number of which are different numerically from the best sequential algorithms, in terms of error properties, how the answer is represented, etc. (This is also discussed in more detail in CS267, taught in the Spring). (2) ASK&WAIT: what is most expensive operation in a computer? Is it doing arithmetic? No: it is moving data, say between DRAM and cache, or between processors over a network. You can only do arithmetic on data stored in cache, not in DRAM, or on data stored on the same parallel processor, not different ones (draw pictures of basic architectures). It can cost 10x, 100x, 1000x more to move a word than do add/sub/multiply. Ex: Consider adding two nxn matrices C=A+B. The cost is n^2 reads of A (moving each A(i,j) from main memory to the CPU, i.e. the chip that does the addition), n^2 reads of B, n^2 additions, and n^2 writes of C. The reads and writes cost O(100) times as much as the additions. Ex: nVIDIA GPU (circa 2008) attached to a CPU: It cost 4 microseconds to call a subroutine in which time you could have done 1e7 flops since GPU ran at 300 GFlops/s. Technology trends are making this worse: arithmetic getting faster 60%/year, but speed of getting data from DRAM just 23%/year Consequence: two different algorithms for the same problem, even if they do the same number of arithmetic operations, may differ vastly in speed, because they do different numbers of memory moves. Ex: difference between matmul written in the naive way and optimized easily 40x, same for other operations (this is again a topic of CS267) We have recently discovered new algorithms for most of the algorithms discussed in this class that provably minimize data movement, and are much faster than the conventional algorithms. We have also gotten grants to replace the standard libraries (called LAPACK and ScaLAPACK) used by essentially all companies (including Matlab!). So you (and many others) will be using these new algorithms in the next few years (whether you know it or not!). This is ongoing work with lots of open problems and possible class projects. (3) ASK&WAIT: So far we have been talking about minimizing the time to solve a problem. Is there any other metric besides time that matters? Yes: energy. It turns out that a major obstacle to Moore's Law continuing as it has in the past is that it would take too much energy: the chips would get so hot they would melt, if we tried to build them the same way as before. And whether you are concerned about the battery in your laptop dying, or the $1M per megawatt per year it costs to run your datacenter or supercomputer, people are looking for ways to save energy. ASK&WAIT: Which operations performed by a computer cost the most energy? Again, moving data can cost orders of magnitude more energy per operation than arithmetic, so the algorithms that minimize communication can also minimize energy. Another office hours story: Someone from industry: "Could you help me solve a least squares problem, that is find x to minimize the 2-norm of A*x-b, where A has more rows than columns." Professor: "How accurately do you need the solution?" SFI: "I suppose it would be easiest for my customers if I could promise them the answer was always correct to 16 digits." P: "That will be expensive, there are hard cases where your matrix is exactly low rank, which means the answer is not unique, and to provably identify and report that to the user, we'll need arbitrary precision arithmetic, which can be quite slow." SFI: "That's too expensive. What if I don't promise them what happens if they give me a low-rank matrix (for which I can blame them for giving me an ill-posed problem)? What if I just promise that it is accurate for full-rank problems? P: "When your matrix is close to low-rank, something we call ill-conditioned, then we still may need arbitrary precision to get an accurate answer. But we can promise an accurate answer as long as you are not too close to a low-rank matrix, say at least a distance 1e-16 away, but we will need to do some 128-bit ("quadruple precision") arithmetic, so it still won't be as cheap as possible." SFI: "My customers don't even give me the data to 16 digits, so I think that would be overkill." P: "In that case, you might want to use the standard algorithm, which is 'backward stable', which means that it is actually equivalent (in the usual 16-digit double precision arithmetic) to perturbing the input data in the 16-th digit, and then computing the exact answer: output of the algorithm = f_alg(x) = f_exact(x + dx) where x is your data, x+dx is the perturbed data, and f_exact() is the mathematically exact solution that the algorithm f_alg() approximates. This makes almost all users happy, getting exactly the right answer to almost the same problem they wanted to solve, since they usually only know the problem approximately anyway. And there are low cost ways to estimate an error bound, i.e. a bound on the error of the form error = f_alg(x) - f_exact(x) = f_exact(x+dx) - f_exact(x) ~ f_exact'(x)*dx using calculus to estimate f_exact'(x), which may be large if the input matrix is close to low-rank. The size of f_exact'(x), appropriately scaled, is called the condition number of the problem x. SFI: "Actually, I work on Wall Street, and probably getting an answer that probably has a not-too-large error is probably good enough." P: "In that case you should consider a randomized algorithm that has just those properties, and is much cheaper than the standard algorithm, costing as little as O(nnz) where nnz = number of nonzeros in your matrix, vs O(m*n^2) where m = #rows and n = #columns." There is one more kind of "accuracy" demand that a user might make: to get the bit-wise identical answer every time they run the program (or call the same subroutine with the same input). This seems like a reasonable expectation, and is expected by many users for debugging (to be able to reproduce errors) and for correctness. But because of parallelism, this is no longer guaranteed, or even likely to occur. We will return to this when we talk about floating point arithmetic. This illustrates another theme of the semester, that there is a tradeoff between accuracy and speed, that "backward stability", getting the right answer for almost the right problem, is a property of many algorithms we will discuss, and that we can estimate condition numbers to get error bounds if desired. To summarize the syllabus of the course: Ax=b, least squares, eigenproblems, Singular Value Decomposition (SVD) and variations Direct methods (aka matrix factorizations: LU, Cholesky, QR, Schur form etc.) Iterative methods (eg Jacobi, Gauss-Seidel, Conjugate Gradients, Multigrid, etc.) Deterministic and randomized algorithms Shared themes: exploiting mathematical structure (eg symmetry, sparsity, ...) numerical stability / condition numbers - how accurate is my answer? efficiency (minimizing flops and communication = data movement) finding good existing software