Notes for Ma221 Lecture 1, Aug 27 2009

Greetings, class URL: www.cs.berkeley.edu/~demmel/ma221
   Prereqs: good linear algebra, 
            programming experience in Matlab/{Fortran,C,C++}
   First Homework on web page
   Enrollment: will get larger class next week if necessary
   Status of books in bookstore
   Fill out survey - what we cover may depend on what you say
      you are interested in
   Office hours TBD
   Reader - recruiting now!

Syllabus:
   Ax=b, least squares, eigenproblems, Singular value decomposition
      variations
   Standard techniques

     matrix factorizations
        how to solve triangular linear system?
          A = PLU - Gaussian elimination
        how to solve a least squares problem?
          A = QR  = "QR" decomposition, Q orthogonal, R upper triangular
        eigenproblems: what are eigenvalues of a triangular matrix?
          A = Q*T*Q', Q unitary, T triangular - Schur factorization
        Singular value decomposition:
          A = U Sigma V'
        dense and sparse versions, "partial factorizations" for sparse matrices

     iterative methods
        those based only on matrix-vector multiplication 
             (Krylov subspace methods, eg conjugate gradient method)
        those that depend more on matrix structure

     importance of exploiting structure
      typical dialogue with someone in my office hours:
      What is the best way to solve n-by-n linear system Ax=b?
        Gauss elim:                                      (2/3)n^3
           if A is symmetric and positive definite, then:
        Cholesky:                                        (1/3)n^3
           if A is banded, with bandwith sqrt(n), then:
        Band Cholesky,                                      O(n^2)
           if at most 5 nonzeros/row, condition number = O(n)I, then:
        Conjugate Gradients                                 O(n^(1/5))
           if problem is finding the equilibrium temperature 
           distribution in unit square of uniform material given
           temperature at the boundaries (Poisson equation), then:
        FFT:                                             O(n log n)
           or
        Multigrid:                                       O(n)
        Since the size of the answer is n, this is optimal
        one example of path in "decision" tree: each node
          ask a question about structure of problem, and the
          more you know, the more potential for speed or accuracy you have
       we'll see other decision trees 
       exploiting structure is a major source of research activity
           
     understanding sources of error - example from Civil Engineering
        Consider question of computing displacement of steel structure 
        ("will this building fall down in an earthquake?")
          (1) write down governing equations (eg PDE) (CE231)
                error sources - continuum mechanics averages over crystal structure
                   don't know material properties to many digits
          (2) discretize PDE, i.e. replace by linear system, 
                using, say, FEM (not topic of this course)
              - error from discretization (ME280A)
          (4) compute and store coefficients of matrix
              - error from computation and storage (eg what happens to 1/3?)
          (5) run iterative algorithm
              - error from stopping iteration after finite # steps
          (6) each arithmetic operation rounded off
              - error from roundoff
        last 3 error sources are topics of this course, but in overall problem 
        error may well be dominated by first 3 (this is a goal!)
           Subject of understanding error of first 3 course by itself...

     perturbation theory and condition numbers
        way to measure contribution to error from input
          including uncertainty from problem formulation
       suppose problem is to evaluate scalar function y = f(x)
          but we only know x + dx, and that dx is "small compared to x"
            absolute error = f(x+dx)-f(x) ~ f'(x)dx
            relative error = (f(x+dx)-f(x))/|f(x)| ~ |f'(x) x / f(x)| * |dx/x|
              = condition number * relative change in input
       ASK: why relative error instead of absolute error? avoid units
       Next suppose that we have an algorithm alg(x)~f(x)
         which makes approximations, makes roundoff error,
           only take a finite number of iterations, what do we do?
       Def: alg(x) is "backward stable" if alg(x) = f(x+dx), dx/x small
            right answer for slightly wrong problem
       then alg(x)-f(x) = f(x+dx)-f(x), and above analysis applies
       how to interpret when x,y vectors or matrices? 
          solving A*x=b mean f({A,b}) = inv(A)*b
          need to show that our algorithms give inv(A+dA)*(b+db)
       need to develop more tools later.

     usually there are many possible algorithms to choose, how to pick one,
     and/or how to implement it?
       (need to do this even after you have decided, say, that Gauss Elim, 
       or CG is "best")
       Criteria:
          (1) fewest keystrokes:  eg "A\b", we will try to give pointers
               to the best available implementations, usually in libraries
               lots of pointers on class webpage (netlib, GAMS) (demo)
          (2) fastest
              traditional approach: count flops
              ASK: how many operations does it take to multiply 2 nxn matrices?    
              Classical: 2*n^3
              Strassen (1969):  O(n^log2 7) ~ O(n^2.81) - sometimes practical
              Winograd/Coppersmith (1987):  O(n^2.376)  - not practical
              Umans/Cohn: O(n^2.376), maybe O(n^2)? reduced to group theory 
                    problem (FOCS2003) - not yet practical
              D, Dumitriu, Holtz (2008): all the other problems as fast 
                 as matmul (and backward stable) - ideas behind some of these 
                 algorithms are practical, and will be discussed

              but counting flops not really right in today's and future world,
              for two reason:
              ASK: what is most expensive operation in a computer?
                   moving data, DRAM - cache or between processors over a network
                     can cost 100x, 1000x more to move word than add/sub/mul
                   Ex: nVIDIA GPU attached to CPU: 4 microseconds to call a 
                       subroutine in which time you could have done 1e7 flops
                       since GPU runs at 300 GFlops/s
                   Consequence: two different implementations of the same algorithm,
                     algorithm (matmul, GE) that do the same #flops may differ vastly 
                     in speed, because they do different numbers of memory moves.
                   Ex: difference between matmul written in the naive way and
                       optimized easily 10x, same for other ops
                   We will discuss a general theory and recent results for 
                     how to minimize the data movement in algorithms
                     (recent work, lots of projects)
      
              second related reason:
              ASK: who knows Moore's Law?
                 Until 2007: computers double in speed every 18 months with no code changes.
                 This has ended, instead all processor counts are doubling every 2 years,
                 so all code that needs to run faster has to change to run in parallel.
                 This is causing upheaval in the computer industry, as a result
                 ParLab was established at UC Berkeley to research this.
                 We will discuss parallel versions of many algorithms
                 Note that parallelism and data movement are coupled, because moving data 
                     between processors is expensive.

         (3) accuracy
               can't always guarantee tiny error at reasonable cost
               try to guarantee tiny backward error
                 "right answer for a slightly different problem"
               provide cheap, estimated condition numbers
                 for rough error bounds
               what if you really need a proof that the error is small: 
                  depends what you mean by "proof":
                     high reliability: run variant of Newton's method in 
                        high precision until converges
                     use interval arithmetic
              

     How to engineer good numerical software?
          Easiest:   use existing software if appropriate (MatLab)
             finding existing software on web (NETLIB, GAMS)  (see class homepage)
          Harder:
             writing your own - is it worth it?
             Show results of tuning matmul from CS267
                    http://www.cs.berkeley.edu/~volkov/cs267.sp09/hw1/results/
             eg: suppose you type "x=A\b" in Matlab to solve Ax=b 
                     (no programming required)
                 and it takes a minute to solve.
                 Suppose that A has special structure so that if you
                 spent 1 week programming, you could solve it in 1 second.
                 Should you spend that week? (trick question)
                 Answer: if you only need to solve it once, no
                         if you or others need to solve 
                            over 7 days/59 secs ~ 1e4 times, yes
          Current research: generating code automatically
              motivation: hard to do, implementations keep changing
                as computers change, see bebop.cs.berkeley.edu

Example: Polynomial Evaluation, and polynomial zero finding

   EX: how bisection works 
       try it on (x-2)(x-3)(x-4) = x^3 - 9*x^2 + 26*x - 24
       rts = [2,3,4]
       coeff = poly(rts)
       help bisect (on web page)
       bisect(coeff,2.6,3.1,1e-12)
       bisect(coeff,2.2,3.2,1e-12) 
           ... no surprises, get small intervals around 3
       
       now try it on (x-2)^13
       rts = 2*ones(1,13)
       coeff = poly(rts)
       bisect(coeff,1.7,2.4,1e-12)
       bisect(coeff,1.7,2.2,1e-12)
       bisect(coeff,1.9,2.2,1e-12) ... huh?
   
   Horner's rule to (x-2)^13 (matlab demo) - what is real graph?
     x = 1.7:1e-4:2.3;
     y = polyval(coeff,x);
     yacc = (x-2).^13;
     plot(x,y,'k.',x,yacc,'r')
     axis([1.8,2.2,-1e-8,1e-8])
       ... can we explain this