Notes for Ma221 Lecture 1, Aug 26, 2010 Greetings! class URL: www.cs.berkeley.edu/~demmel/ma221 Prereqs: good linear algebra, programming experience in Matlab/{Fortran,C,C++} numerical sophistication at the level of Ma128ab desireable First Homework on web page Enrollment: no problem fitting everyone so far No final, will be project instead Status of books in bookstore - should be there Fill out survey - what we cover may depend on what you say you are interested in Office hours TBD Reader - recruiting now! Syllabus: Ax=b, least squares, eigenproblems, Singular value decomposition variations Standard techniques for some matrices, these problems are easy: diagonal, triangular, permutations, orthogonal ... matrix factorizations - A written as product of such "simpler" matrices, how to solve triangular linear system? A = PLU = perm * lower_tri * upper_tri aka "LU" factorization or Gaussian elimination how to solve a least squares problem? A = QR = orth * upper_tri, aka "QR" decomposition, A classical algorithm for this is Gram-Schmidt; however it can give awful error for reasons of roundoff, so we will need a different algorithm. eigenproblems: what are eigenvalues of a triangular matrix? A classical matrix factorization is the Jordan Canonical Form A = S*J*inv(S), where J is in Jordan form, and the columns of S are the eigenvectors; this also turns out to be awful numerically. Instead we compute the Schur factorization: A = Q*T*Q', Q unitary, T triangular Singular value decomposition (for rank-deficient least squares etc) A = U Sigma V' Same factorizations can be used for less obvious purposes, eg: given x' = A*x + B*u, y = C*x, can I use feedback control (u = F*y for some fixed matrix F) to control x, e.g. keep x bounded? iterative methods - when matrix so large that factorizations are too expensive simplest nontrivial operation you do with a matrix is multiply it by a vector: some iterative methods only need this: (Krylov subspace methods, eg conjugate gradient method) others use more on matrix structure, can be much faster (multigrid) understanding sources of error - example from Civil Engineering Consider question of computing displacement of steel structure ("will this building fall down in an earthquake?") (1) write down governing equations (eg PDE) (CE231) error sources - continuum mechanics averages over crystal structure, which we don't know, and we also don't know the material properties to many digits (2) discretize PDE, i.e. replace by linear system, using, say, FEM (not topic of this course) - error from discretization (ME280A) (3) compute and store coefficients of matrix - error from computation and storage (eg what happens to 1/3?) (4) run iterative algorithm - error from stopping iteration after finite # steps (5) each arithmetic operation rounded off - error from roundoff The last 3 error sources are topics of this course, but overall the problem error may well be dominated by first 3 (this is a goal for us, that practitioners of civil and mechanical engineering can concentrate on the error introduced by their models, and be confident that other errors are less significant). Understanding errors from the first 2 sources are separate courses. To understand these 3 error sources, we need to discuss perturbation theory, condition numbers, and numerical stability way to measure contribution to error from input including uncertainty from problem formulation suppose problem is to evaluate scalar function y = f(x) but we only know x + dx, and that dx is "small compared to x" absolute error = f(x+dx)-f(x) ~ f'(x)dx relative error = (f(x+dx)-f(x))/|f(x)| ~ |f'(x) x / f(x)| * |dx/x| = condition number * relative change in input ASK: why relative error instead of absolute error? avoid units (eg f'(x) depends on the units of f(x), but f'(x)/f(x) does not; and dx/x and dy/y can be easily interpreted as giving the number of correct leading digits) Next suppose that we have an algorithm alg(x)~f(x) which makes approximations, makes roundoff error, only take a finite number of iterations, what do we do? Def: alg(x) is "backward stable" if alg(x) = f(x+dx), dx/x small right answer for slightly wrong problem then alg(x)-f(x) = f(x+dx)-f(x), and above analysis applies consequence: if the condition number is large ("ill-conditioned"), then either have to settle for a large error bound, or use more expensive algorithm Here is a geometric interpretation of the condition number, in particular when it is very large. How can f'(x)*x/f(x) be very large? The typical way is for f(x) to be tiny, i.e. x to be close to a root r, where f(r)=0. In this case, we can use Newton's method to estimate r: r = x - f(x)/f'(x) . Now divide this by x, and rearrange to get f'(x)*x/f(x) = 1/((x-r)/x) = 1/(relative distance from x to r) = 1/(relative distance from x to nearest problem whose condition number is infinite) So we can think of the ill-conditioned problems geometrically as small neighborhoods of the set of "infinitely ill-conditioned problems" We need to interpret all this when x,y are vectors or matrices? Ex: solving A*x=b mean f({A,b}) = inv(A)*b, so (1) we need to differentiate this to get a condition number (2) we need to show that our algorithms give inv(A+dA)*(b+db), i.e. are backward stable (3) we need to show that the condition number of A = 1/distance from A to the nearest "infinitely ill-conditioned problem" = 1/distance from A to the nearest singular matrix In this case the singular value decomposition will tell us all this info. We need to develop more tools later to do this, such as norms (to generalize the notion of "absolute value" to vectors and matrices) importance of exploiting mathematical structure typical dialogue with someone in my office hours: What is the best way to solve n-by-n linear system Ax=b? Gauss elim: (2/3)n^3 if A is symmetric and positive definite, then: Cholesky: (1/3)n^3 if A is banded, with bandwith sqrt(n), then: Band Cholesky, O(n^2) if at most 5 nonzeros/row, condition number = O(n)I, then: Conjugate Gradients O(n^(3/2)) if problem is finding the equilibrium temperature distribution in unit square of uniform material given temperature at the boundaries (Poisson equation), then: FFT: O(n log n) or Multigrid: O(n) Since the size of the answer is n, this is optimal one example of path in "decision" tree: each node ask a question about structure of problem, and the more you know, the more potential for speed or accuracy you have we'll see other decision trees exploiting structure is a major source of research activity usually there are many possible variations on these algorithms, and many implementations; how do we chose the best one? Criteria: (1) fewest keystrokes: eg "A\b", we will try to give pointers to the best available implementations, usually in libraries lots of pointers on class webpage (netlib, GAMS) (demo) This invokes the LAPACK library, which is also used as the basis of the libraries used by most computer vendors, and has been developed in a collaboration by Berkeley and other universities over a period of years. (2) fastest traditional approach: count flops ASK: how many operations does it take to multiply 2 nxn matrices? Classical: 2*n^3 Strassen (1969): O(n^log2 7) ~ O(n^2.81) - sometimes practical Winograd/Coppersmith (1987): O(n^2.376) - not practical Umans/Cohn: O(n^2.376), maybe O(n^2)? reduced to group theory problem (FOCS2003) - not yet practical Demmel, Dumitriu, Holtz (2008): all the other problems as fast as matmul (and backward stable) - ideas behind some of these algorithms are practical, and will be discussed but counting flops not really right in today's and future world, for two reasons: (1) ASK: who knows Moore's Law? Until 2007: computers doubled in speed every 18 months with no code changes. This has ended, instead the only way computers can run faster is by having multiple processors, so all code that needs to run faster (not just linear algebra!) has to change to run in parallel. This is causing upheaval in the computer industry, as a result ParLab was established at UC Berkeley to research this. We will discuss parallel versions of many algorithms. (2) ASK: what is most expensive operation in a computer? moving data, DRAM - cache or between processors over a network can cost 100x, 1000x more to move word than add/sub/mul Ex: Consider adding two nxn matrices C=A+B. The cost is n^2 reads of A (moving each A(i,j) from main memory to the CPU, i.e. the chip that does the addition), n^2 reads of B, n^2 additions, and n^2 writes of C. The reads and writes cost O(100) times as much as the additions Ex: nVIDIA GPU (2 years ago) attached to a CPU: It cost 4 microseconds to call a subroutine in which time you could have done 1e7 flops since GPU runs at 300 GFlops/s; Technology trends are making this worse: arithmetic getting faster 60%/year, but speed of getting data may just 15%/year Consequence: two different algorithms for the same problem, even if they do the same number of arithmetic operations, may differ vastly in speed, because they do different numbers of memory moves. Ex: difference between matmul written in the naive way and optimized easily 40x, same for other operations We have discovered new algorithms for many of the matrix factorizations discussed in this class that provably minimize data movement, and are much faster than the conventional algorithms. We have also gotten large grants to replace the standard libraries (called LAPACK and ScaLAPACK) used by essentially all companies (including Matlab!). So you (and many others) will be using these new algorithms in the next few years (whether you know it or not!). Note that parallelism and data movement are coupled, because moving data between processors is expensive. (3) accuracy can't always guarantee tiny error at reasonable cost, if ill-conditioned try to guarantee tiny backward error "right answer for a slightly different problem" if you only know your input to 12 digits, and backward error only has uncertainty in the 15-th digit, then answer as good as you can expect provide cheap, estimated condition numbers for rough error bounds what if you really need a proof that the error is small: depends what you mean by "proof": high reliability: run variant of Newton's method in high precision until converges use interval arithmetic How to engineer good numerical software? Easiest: use existing software if appropriate (MatLab) finding existing software on web (NETLIB, GAMS) (see class homepage) Harder: writing your own - is it worth it? Show results of tuning matmul from CS267 http://www.cs.berkeley.edu/~volkov/cs267.sp09/hw1/results/ eg: suppose you type "x=A\b" in Matlab to solve Ax=b (no programming required) and it takes a minute to solve. Suppose that A has special structure so that if you spent 1 week programming, you could solve it in 1 second. Should you spend that week? (trick question) Answer: if you only need to solve it once, no if you or others need to solve over 7 days/59 secs ~ 1e4 times, yes Current research: generating code automatically motivation: hard to do, implementations keep changing as computers change, see bebop.cs.berkeley.edu Example: Polynomial Evaluation, and polynomial zero finding EX: how bisection works try it on (x-2)(x-3)(x-4) = x^3 - 9*x^2 + 26*x - 24 rts = [2,3,4] coeff = poly(rts) help bisect (on web page) bisect(coeff,2.6,3.1,1e-12) bisect(coeff,2.2,3.2,1e-12) ... no surprises, get small intervals around 3 now try it on (x-2)^13 rts = 2*ones(1,13) coeff = poly(rts) bisect(coeff,1.7,2.4,1e-12) bisect(coeff,1.7,2.2,1e-12) bisect(coeff,1.9,2.2,1e-12) ... huh? a very different asnwer each time? Horner's rule to (x-2)^13 (matlab demo) - what is the real graph? x = 1.7:1e-4:2.3; y = polyval(coeff,x); yacc = (x-2).^13; plot(x,y,'k.',x,yacc,'r') axis([1.8,2.2,-1e-8,1e-8]) ... can we explain this?