Notes for Ma221 Lecture 1, Aug 27 2009 Greetings, class URL: www.cs.berkeley.edu/~demmel/ma221 Prereqs: good linear algebra, programming experience in Matlab/{Fortran,C,C++} First Homework on web page Enrollment: will get larger class next week if necessary Status of books in bookstore Fill out survey - what we cover may depend on what you say you are interested in Office hours TBD Reader - recruiting now! Syllabus: Ax=b, least squares, eigenproblems, Singular value decomposition variations Standard techniques matrix factorizations how to solve triangular linear system? A = PLU - Gaussian elimination how to solve a least squares problem? A = QR = "QR" decomposition, Q orthogonal, R upper triangular eigenproblems: what are eigenvalues of a triangular matrix? A = Q*T*Q', Q unitary, T triangular - Schur factorization Singular value decomposition: A = U Sigma V' dense and sparse versions, "partial factorizations" for sparse matrices iterative methods those based only on matrix-vector multiplication (Krylov subspace methods, eg conjugate gradient method) those that depend more on matrix structure importance of exploiting structure typical dialogue with someone in my office hours: What is the best way to solve n-by-n linear system Ax=b? Gauss elim: (2/3)n^3 if A is symmetric and positive definite, then: Cholesky: (1/3)n^3 if A is banded, with bandwith sqrt(n), then: Band Cholesky, O(n^2) if at most 5 nonzeros/row, condition number = O(n)I, then: Conjugate Gradients O(n^(1/5)) if problem is finding the equilibrium temperature distribution in unit square of uniform material given temperature at the boundaries (Poisson equation), then: FFT: O(n log n) or Multigrid: O(n) Since the size of the answer is n, this is optimal one example of path in "decision" tree: each node ask a question about structure of problem, and the more you know, the more potential for speed or accuracy you have we'll see other decision trees exploiting structure is a major source of research activity understanding sources of error - example from Civil Engineering Consider question of computing displacement of steel structure ("will this building fall down in an earthquake?") (1) write down governing equations (eg PDE) (CE231) error sources - continuum mechanics averages over crystal structure don't know material properties to many digits (2) discretize PDE, i.e. replace by linear system, using, say, FEM (not topic of this course) - error from discretization (ME280A) (4) compute and store coefficients of matrix - error from computation and storage (eg what happens to 1/3?) (5) run iterative algorithm - error from stopping iteration after finite # steps (6) each arithmetic operation rounded off - error from roundoff last 3 error sources are topics of this course, but in overall problem error may well be dominated by first 3 (this is a goal!) Subject of understanding error of first 3 course by itself... perturbation theory and condition numbers way to measure contribution to error from input including uncertainty from problem formulation suppose problem is to evaluate scalar function y = f(x) but we only know x + dx, and that dx is "small compared to x" absolute error = f(x+dx)-f(x) ~ f'(x)dx relative error = (f(x+dx)-f(x))/|f(x)| ~ |f'(x) x / f(x)| * |dx/x| = condition number * relative change in input ASK: why relative error instead of absolute error? avoid units Next suppose that we have an algorithm alg(x)~f(x) which makes approximations, makes roundoff error, only take a finite number of iterations, what do we do? Def: alg(x) is "backward stable" if alg(x) = f(x+dx), dx/x small right answer for slightly wrong problem then alg(x)-f(x) = f(x+dx)-f(x), and above analysis applies how to interpret when x,y vectors or matrices? solving A*x=b mean f({A,b}) = inv(A)*b need to show that our algorithms give inv(A+dA)*(b+db) need to develop more tools later. usually there are many possible algorithms to choose, how to pick one, and/or how to implement it? (need to do this even after you have decided, say, that Gauss Elim, or CG is "best") Criteria: (1) fewest keystrokes: eg "A\b", we will try to give pointers to the best available implementations, usually in libraries lots of pointers on class webpage (netlib, GAMS) (demo) (2) fastest traditional approach: count flops ASK: how many operations does it take to multiply 2 nxn matrices? Classical: 2*n^3 Strassen (1969): O(n^log2 7) ~ O(n^2.81) - sometimes practical Winograd/Coppersmith (1987): O(n^2.376) - not practical Umans/Cohn: O(n^2.376), maybe O(n^2)? reduced to group theory problem (FOCS2003) - not yet practical D, Dumitriu, Holtz (2008): all the other problems as fast as matmul (and backward stable) - ideas behind some of these algorithms are practical, and will be discussed but counting flops not really right in today's and future world, for two reason: ASK: what is most expensive operation in a computer? moving data, DRAM - cache or between processors over a network can cost 100x, 1000x more to move word than add/sub/mul Ex: nVIDIA GPU attached to CPU: 4 microseconds to call a subroutine in which time you could have done 1e7 flops since GPU runs at 300 GFlops/s Consequence: two different implementations of the same algorithm, algorithm (matmul, GE) that do the same #flops may differ vastly in speed, because they do different numbers of memory moves. Ex: difference between matmul written in the naive way and optimized easily 10x, same for other ops We will discuss a general theory and recent results for how to minimize the data movement in algorithms (recent work, lots of projects) second related reason: ASK: who knows Moore's Law? Until 2007: computers double in speed every 18 months with no code changes. This has ended, instead all processor counts are doubling every 2 years, so all code that needs to run faster has to change to run in parallel. This is causing upheaval in the computer industry, as a result ParLab was established at UC Berkeley to research this. We will discuss parallel versions of many algorithms Note that parallelism and data movement are coupled, because moving data between processors is expensive. (3) accuracy can't always guarantee tiny error at reasonable cost try to guarantee tiny backward error "right answer for a slightly different problem" provide cheap, estimated condition numbers for rough error bounds what if you really need a proof that the error is small: depends what you mean by "proof": high reliability: run variant of Newton's method in high precision until converges use interval arithmetic How to engineer good numerical software? Easiest: use existing software if appropriate (MatLab) finding existing software on web (NETLIB, GAMS) (see class homepage) Harder: writing your own - is it worth it? Show results of tuning matmul from CS267 http://www.cs.berkeley.edu/~volkov/cs267.sp09/hw1/results/ eg: suppose you type "x=A\b" in Matlab to solve Ax=b (no programming required) and it takes a minute to solve. Suppose that A has special structure so that if you spent 1 week programming, you could solve it in 1 second. Should you spend that week? (trick question) Answer: if you only need to solve it once, no if you or others need to solve over 7 days/59 secs ~ 1e4 times, yes Current research: generating code automatically motivation: hard to do, implementations keep changing as computers change, see bebop.cs.berkeley.edu Example: Polynomial Evaluation, and polynomial zero finding EX: how bisection works try it on (x-2)(x-3)(x-4) = x^3 - 9*x^2 + 26*x - 24 rts = [2,3,4] coeff = poly(rts) help bisect (on web page) bisect(coeff,2.6,3.1,1e-12) bisect(coeff,2.2,3.2,1e-12) ... no surprises, get small intervals around 3 now try it on (x-2)^13 rts = 2*ones(1,13) coeff = poly(rts) bisect(coeff,1.7,2.4,1e-12) bisect(coeff,1.7,2.2,1e-12) bisect(coeff,1.9,2.2,1e-12) ... huh? Horner's rule to (x-2)^13 (matlab demo) - what is real graph? x = 1.7:1e-4:2.3; y = polyval(coeff,x); yacc = (x-2).^13; plot(x,y,'k.',x,yacc,'r') axis([1.8,2.2,-1e-8,1e-8]) ... can we explain this