Notes for Ma221 Lecture 1, Aug 26, 2010

Greetings!
   class URL: www.cs.berkeley.edu/~demmel/ma221
   Prereqs: good linear algebra, 
            programming experience in Matlab/{Fortran,C,C++}
            numerical sophistication at the level of Ma128ab desireable
   First Homework on web page 
   Enrollment: no problem fitting everyone so far
   No final, will be project instead
   Status of books in bookstore - should be there
   Fill out survey - what we cover may depend on what you say
      you are interested in
   Office hours TBD
   Reader - recruiting now!

Syllabus:
   Ax=b, least squares, eigenproblems, Singular value decomposition
      variations

   Standard techniques
     for some matrices, these problems are easy:
       diagonal, triangular, permutations, orthogonal ...

     matrix factorizations - A written as product of such "simpler" matrices,
        how to solve triangular linear system?
          A = PLU = perm * lower_tri * upper_tri 
            aka "LU" factorization or Gaussian elimination
        how to solve a least squares problem?
          A = QR  = orth * upper_tri, aka  "QR" decomposition, 
          A classical algorithm for this is Gram-Schmidt; however
          it can give awful error for reasons of roundoff, so we
          will need a different algorithm.
        eigenproblems: what are eigenvalues of a triangular matrix?
          A classical matrix factorization is the Jordan Canonical Form
          A = S*J*inv(S), where J is in Jordan form, and the columns of
          S are the eigenvectors; this also turns out to be awful numerically.
          Instead we compute the Schur factorization:
          A = Q*T*Q', Q unitary, T triangular 
        Singular value decomposition (for rank-deficient least squares etc)
          A = U Sigma V'
        Same factorizations can be used for less obvious purposes, eg:
          given x' = A*x + B*u, y = C*x, can I use feedback control 
          (u = F*y for some fixed matrix F) to control x, e.g. keep x bounded?

     iterative methods - 
        when matrix so large that factorizations are too expensive
        simplest nontrivial operation you do with a matrix is multiply it
          by a vector:
        some iterative methods only need this:
             (Krylov subspace methods, eg conjugate gradient method)
        others use more on matrix structure, can be much faster
             (multigrid)
           
     understanding sources of error - example from Civil Engineering
        Consider question of computing displacement of steel structure 
        ("will this building fall down in an earthquake?")
          (1) write down governing equations (eg PDE) (CE231)
                error sources - continuum mechanics averages over crystal 
                 structure, which we don't know, and we also don't know 
                 the material properties to many digits
          (2) discretize PDE, i.e. replace by linear system, 
                using, say, FEM (not topic of this course)
              - error from discretization (ME280A)
          (3) compute and store coefficients of matrix
              - error from computation and storage (eg what happens to 1/3?)
          (4) run iterative algorithm
              - error from stopping iteration after finite # steps
          (5) each arithmetic operation rounded off
              - error from roundoff
        The last 3 error sources are topics of this course, but overall the problem 
        error may well be dominated by first 3 (this is a goal for us, that
        practitioners of civil and mechanical engineering can concentrate on the
        error introduced by their models, and be confident that other errors are
        less significant).
           Understanding errors from the first 2 sources are separate courses.

     To understand these 3 error sources, we need to discuss
     perturbation theory, condition numbers, and numerical stability
        way to measure contribution to error from input
          including uncertainty from problem formulation
       suppose problem is to evaluate scalar function y = f(x)
          but we only know x + dx, and that dx is "small compared to x"
            absolute error = f(x+dx)-f(x) ~ f'(x)dx
            relative error = (f(x+dx)-f(x))/|f(x)| ~ |f'(x) x / f(x)| * |dx/x|
              = condition number * relative change in input
       ASK: why relative error instead of absolute error? avoid units
             (eg f'(x) depends on the units of f(x), but f'(x)/f(x) does not;
              and dx/x and dy/y can be easily interpreted as giving the
              number of correct leading digits)
       Next suppose that we have an algorithm alg(x)~f(x)
         which makes approximations, makes roundoff error,
           only take a finite number of iterations, what do we do?
       Def: alg(x) is "backward stable" if alg(x) = f(x+dx), dx/x small
            right answer for slightly wrong problem
       then alg(x)-f(x) = f(x+dx)-f(x), and above analysis applies
       consequence: if the condition number is large ("ill-conditioned"), then
          either have to settle for a large error bound, or use more expensive algorithm

       Here is a geometric interpretation of the condition number, in particular
       when it is very large. How can f'(x)*x/f(x) be very large? The typical
       way is for f(x) to be tiny, i.e. x to be close to a root r, where f(r)=0.
       In this case, we can use Newton's method to estimate r: r = x - f(x)/f'(x) .
       Now divide this by x, and rearrange to get
             f'(x)*x/f(x) = 1/((x-r)/x) = 1/(relative distance from x to r)
                          = 1/(relative distance from x to nearest problem
                               whose condition number is infinite)
       So we can think of the ill-conditioned problems geometrically as small
       neighborhoods of the set of "infinitely ill-conditioned problems"

       We need to interpret all this when x,y are vectors or matrices? 
          Ex: solving A*x=b mean f({A,b}) = inv(A)*b, so 
            (1) we need to differentiate this to get a condition number
            (2) we need to show that our algorithms give inv(A+dA)*(b+db), i.e.
                are backward stable
            (3) we need to show that the condition number of A
                = 1/distance from A to the nearest "infinitely ill-conditioned problem"
                = 1/distance from A to the nearest singular matrix
            In this case the singular value decomposition will tell us all this info.
       We need to develop more tools later to do this, such as norms (to generalize
       the notion of "absolute value" to vectors and matrices)

     importance of exploiting mathematical structure
      typical dialogue with someone in my office hours:
      What is the best way to solve n-by-n linear system Ax=b?
        Gauss elim:                                      (2/3)n^3
           if A is symmetric and positive definite, then:
        Cholesky:                                        (1/3)n^3
           if A is banded, with bandwith sqrt(n), then:
        Band Cholesky,                                      O(n^2)
           if at most 5 nonzeros/row, condition number = O(n)I, then:
        Conjugate Gradients                                 O(n^(3/2))
           if problem is finding the equilibrium temperature 
           distribution in unit square of uniform material given
           temperature at the boundaries (Poisson equation), then:
        FFT:                                             O(n log n)
           or
        Multigrid:                                       O(n)
        Since the size of the answer is n, this is optimal
        one example of path in "decision" tree: each node
          ask a question about structure of problem, and the
          more you know, the more potential for speed or accuracy you have
       we'll see other decision trees 
       exploiting structure is a major source of research activity

     usually there are many possible variations on these algorithms, and
       many implementations; how do we chose the best one?
       Criteria:
          (1) fewest keystrokes:  eg "A\b", we will try to give pointers
               to the best available implementations, usually in libraries
               lots of pointers on class webpage (netlib, GAMS) (demo)
               This invokes the LAPACK library, which is also used as the
               basis of the libraries used by most computer vendors,
               and has been developed in a collaboration by Berkeley and
               other universities over a period of years.
          (2) fastest
              traditional approach: count flops
              ASK: how many operations does it take to multiply 2 nxn matrices?    
              Classical: 2*n^3
              Strassen (1969):  O(n^log2 7) ~ O(n^2.81) - sometimes practical
              Winograd/Coppersmith (1987):  O(n^2.376)  - not practical
              Umans/Cohn: O(n^2.376), maybe O(n^2)? reduced to group theory 
                    problem (FOCS2003) - not yet practical
              Demmel, Dumitriu, Holtz (2008): all the other problems as fast 
                 as matmul (and backward stable) - ideas behind some of these 
                 algorithms are practical, and will be discussed

              but counting flops not really right in today's and future world,
              for two reasons:
              (1) ASK: who knows Moore's Law?
                 Until 2007: computers doubled in speed every 18 months with no code 
                 changes. This has ended, instead the only way computers can run faster 
                 is by having multiple processors, so all code that needs to run faster 
                 (not just linear algebra!) has to change to run in parallel.
                 This is causing upheaval in the computer industry, as a result
                 ParLab was established at UC Berkeley to research this.
                 We will discuss parallel versions of many algorithms.

              (2) ASK: what is most expensive operation in a computer?
                   moving data, DRAM - cache or between processors over a network
                     can cost 100x, 1000x more to move word than add/sub/mul
                   Ex: Consider adding two nxn matrices C=A+B. The cost
                       is n^2 reads of A (moving each A(i,j) from main memory to the 
                       CPU, i.e. the chip that does the addition), n^2 reads of B,
                       n^2 additions, and n^2 writes of C. The reads and writes cost
                       O(100) times as much as the additions
                   Ex: nVIDIA GPU (2 years ago) attached to a CPU: It cost 4 microseconds
                       to call a subroutine in which time you could have done 1e7 flops
                       since GPU runs at 300 GFlops/s; 
                   Technology trends are making this worse: arithmetic getting faster
                       60%/year, but speed of getting data may just 15%/year
                   Consequence: two different algorithms for the same problem, even if
                     they do the same number of arithmetic operations, may differ vastly 
                     in speed, because they do different numbers of memory moves.
                   Ex: difference between matmul written in the naive way and
                       optimized easily 40x, same for other operations

                   We have discovered new algorithms for many of the matrix 
                   factorizations discussed in this class that provably minimize 
                   data movement, and are much faster than the conventional algorithms.
                   We have also gotten large grants to replace the standard libraries 
                   (called LAPACK and ScaLAPACK) used by essentially all companies 
                   (including Matlab!). So you (and many others) will be using
                   these new algorithms in the next few years (whether you know it or 
                   not!).

                   Note that parallelism and data movement are coupled, because moving 
                   data between processors is expensive.
      
         (3) accuracy
               can't always guarantee tiny error at reasonable cost, if ill-conditioned
               try to guarantee tiny backward error
                 "right answer for a slightly different problem"
                 if you only know your input to 12 digits, and backward error
                   only has uncertainty in the 15-th digit, then 
                   answer as good as you can expect
               provide cheap, estimated condition numbers
                 for rough error bounds
               what if you really need a proof that the error is small: 
                  depends what you mean by "proof":
                     high reliability: run variant of Newton's method in 
                        high precision until converges
                     use interval arithmetic
              

     How to engineer good numerical software?
          Easiest:   use existing software if appropriate (MatLab)
             finding existing software on web (NETLIB, GAMS)  (see class homepage)
          Harder:
             writing your own - is it worth it?
             Show results of tuning matmul from CS267
                    http://www.cs.berkeley.edu/~volkov/cs267.sp09/hw1/results/
             eg: suppose you type "x=A\b" in Matlab to solve Ax=b 
                     (no programming required)
                 and it takes a minute to solve.
                 Suppose that A has special structure so that if you
                 spent 1 week programming, you could solve it in 1 second.
                 Should you spend that week? (trick question)
                 Answer: if you only need to solve it once, no
                         if you or others need to solve 
                            over 7 days/59 secs ~ 1e4 times, yes
          Current research: generating code automatically
              motivation: hard to do, implementations keep changing
                as computers change, see bebop.cs.berkeley.edu

Example: Polynomial Evaluation, and polynomial zero finding

   EX: how bisection works 
       try it on (x-2)(x-3)(x-4) = x^3 - 9*x^2 + 26*x - 24
       rts = [2,3,4]
       coeff = poly(rts)
       help bisect (on web page)
       bisect(coeff,2.6,3.1,1e-12)
       bisect(coeff,2.2,3.2,1e-12) 
           ... no surprises, get small intervals around 3
       
       now try it on (x-2)^13
       rts = 2*ones(1,13)
       coeff = poly(rts)
       bisect(coeff,1.7,2.4,1e-12)
       bisect(coeff,1.7,2.2,1e-12)
       bisect(coeff,1.9,2.2,1e-12) ... huh? a very different asnwer each time?
   
   Horner's rule to (x-2)^13 (matlab demo) - what is the real graph?
     x = 1.7:1e-4:2.3;
     y = polyval(coeff,x);
     yacc = (x-2).^13;
     plot(x,y,'k.',x,yacc,'r')
     axis([1.8,2.2,-1e-8,1e-8])
       ... can we explain this?