Notes for Randomized Numerical Linear Algebra, updated 03 Mar 2015
Starting today, we will videocapture each lecture, broadcast them live, and
archive them on youtube. The link today will be
   https://www.youtube.com/watch?v=q6mcfCCXjUo
We will post instructions for future lectures on the website.
In addition to material on the class website
   https://who.rocq.inria.fr/Laura.Grigori/RandNLA.html
we borrow from material presented in Ma221 in Fall 2014, at
   http://www.cs.berkeley.edu/~demmel/ma221_Fall14/Lectures/Lecture_19.txt
Papers on recent communication avoiding algorithms like TSQR may be found at
   bebop.cs.berkeley.edu

Our goals in this introductory lecture are 
(1) to discuss the metrics for performance, accuracy and reliability that we want 
    to use to compare randomized and nonrandomized algorithms (both direct and 
    iterative), which we hope will inform discussions later in the semester too;
(2) recall different kinds of Least Squares (LS) problems, and for the simplest case 
    of full-rank, overdetermined problems, the usual nonrandomized algorithms
    (direct: NE = Normal Equations, QR and SVD, and iterative: LSQR) and what 
    their metrics are;
(3) discuss randomized LS solvers, including Blendenpik, Clarkson-Woodruff, ...

Metrics:
(1) #Flops: This may just depend on the matrix dimensions, or also sparsity structure.

    For dense, square n x n matrices, direct linear algebra algorithms typically cost
    O(n^3).  Given a fast matmul algorithm costing O(n^w), there are "numerically 
    stable" O(n^(w+ eps)) algorithms, for any eps>0; see "Fast Linear Algebra is
    Stable" by Demmel/Dumitriu/Holtz for details. In at least one case, the 
    nonsymmetric eigenproblem, the only algorithm attaining this uses randomization.
    For m x n matrices with m>n, for example the overdetermined LS problem, the
    complexities are O(m*n^2) and O(m*n^(w - 1 + eps)). We will mostly consider
    O(n^3) or O(m*n^2) algorithms, which are often (but not always) more practical.

    For sparse matrices, with nnz(A) = number of nonzeros in A << m*n, we seek
    both direct and iterative algorithms that do much less arithmetic. 
    Suppose A has n(i) nonzeros in row i, so nnz(A) = sum_i n(i). Looking ahead,
    we will have nonrandomized direct algorithms for LS with m >> n that cost
    O(sum_i n(i)^2) flops (plus ``lower order terms''), which will be a lower bound 
    under certain assumptions, in contrast to a randomized algorithm that costs 
    O(sum_i n(i)) = O(nnz(A)) flops (Clarkson-Woodruff), or O(m*n) in the dense case.
    
    Iterative methods like LSQR are also used, where the inner loop is typically 
    matrix-vector multiplication and so does 2*nnz(A) flops. The number of iterations 
    depends on numerical properties of the matrix, like the distribution of its 
    singular values, often summarized by the condition number 
    kappa(A) = sigma_max(A)/sigma_min(A). Looking ahead, we will describe a 
    nonrandomized iterative method that does O(kappa(A)) iterations, and a 
    randomized one that does O(1) iteration w.h.p. (Blendenpik), and so again costs
    O(m*n) in the dense case.
  
(2) Communication cost: Communication means moving data, either between levels of
    memory hierarchy or between parallel processors over a network, and costs
    orders of magnitude more per operation than a flop. #Flops is a good metric
    for small problems that fit in cache, but the motivation for randomized 
    algorithms are very large problems where this is not the case.
    So our performance model for sending one "message" of n contiguous words will be
        alpha + n*beta = latency + n/bandwidth
    where a "message" could be a cache line in a shared memory system, an actual
    message in a distributed memory system (like Hopper and Edison), or a disk access.

    Our overall performance model (along critical path of an algorithm) is then
        gamma*#Flops + beta*#Words_sent + alpha*#Messages
      = gamma*F + beta*W + alpha*S
    where gamma << beta << alpha, with the gaps growing over time following
    technological trends. We can use the same model for energy, where gamma etc
    have units of joules instead of seconds, since energy usage is becoming
    at least as important metric as run-time (and they are obviously correlated).

    There has been a lot of recent work on communication lower bounds, starting with
    linear algebra, but extending to code that can be described abstractly as
    “for all k-tuples of integers in some some subset of Z^k, execute some code that 
    accesses arrays whose subscripts are affine functions of those integers.” This 
    class of algorithms clearly includes matrix multiplication C = A*B, with 3-tuples
    of integers (i,j,k) accessing arrays A(i,k) B(k,j) and C(i,j), but most other 
    linear algebra algorithms (LU, QR, …), for dense or sparse matrices, and either
    sequential or parallel implementations. And it goes beyond classical linear 
    algebra, including fast linear algebra (eg Strassen's method), some graph 
    algorithms, direct n-body algorithms, tensor contractions, etc. 
 
    For this class, we will just state the lower bound for (most) classical linear
    algebra, and use it as a goal against which to measure our algorithms.  Since
    randomized algorithm often use classical linear algebra as building blocks, it
    will apply at least to components of them.

    To state the result, consider a 2-level memory hierarchy, called fast and slow,
    where fast could equal cache and slow = DRAM, or fast = DRAM and slow = disk,
    or fast = local memory of one parallel processor, and slow = memory of other
    parallel processors connected by a network. We assume that arithmetic can
    only be done on data in the fast memory.  We want to lower bound
      W = number of words moved between fast and slow memory
        = Omega(max( #iterations / M^{1/2}, #inputs, #outputs )
    where the number of iterations refers to the inner loop of the algorithm,
    say a multiply-add for matrix multiply or classical linear algebra more generally.

    In the case of a overdetermined least squares with a dense m x n input matrix,
    running sequentially, #iterations = Theta(m n^2), so this bound becomes
      W = Omega ( max( m*n^2 / M^{1/2} , m*n ) )
        = Omega ( m*n ) when n <= M^{1/2}

    In other words, the lower bound is the size of the input, raising the question
    of whether we can really just access the data once, even when it is arbitrarily
    large, provided n <= M^{1/2}, or n^2 <= M, i.e. we can fit a square
    submatrix in memory.

    Def: A "streaming algorithm" can solve an arbitrarily large problem that is
    read once into fast memory (say a row at a time).

    Note that a streaming algorithm does not even require the data to exist entirely 
    at one time, eg sensors could be producing data in real time that is streamed 
    into a computer and then thrown away. If we instead assume that the data is in 
    fact stored in slow memory, and so can be accessed more than once, we can ask for
    a slightly weaker goal:

    Def: A "weakly streaming algorithm" can solve a problem by reading it from
    slow to fast memory O(1) times.

    TSQR is a recent deterministic streaming algorithm. It targets dense matrices,
    but could be modified to take (some) advantage of sparsity.
    We will later describe randomized (weakly) streaming LS algorithms.

    Regarding the latency cost, given a lower bound on W, a lower bound on the
    number of messages is simply given by S >= W/largest_message_size. The
    largest message size may be determined by characteristics of the HW of SW,
    but is obviously bounded by M, the fast memory size to which the message is
    delivered. Thus
      S = Omega(max( #iterations / M^{3/2}, #inputs/M, #outputs/M )
    Lower bounds on both W and S can be attained by deterministic algorithms,
    TBD for randomized ones.

    In the parallel case, with p processors, where fast memory is the local memory of
    one processor and slow memory is the memory of other processors, we may consider a
    load balanced algorithm so the lower bound becomes
      W = number of words moved between one processor and other processors
        = Omega( (#iterations/p) / M^{1/2} ).
    The smallest we can choose M, for the dense m x n case, is M = m n / p,
    i.e. there is just enough memory to store the problem, leading to the lower bound
      W = Omega ( (m n^2 / p)/(m n / p)^{1/2) 
        = Omega ( m^{1/2} n^{3/2} / p^{1/2} )
    which is attainable by recent deterministic algorithms. The latency lower bound 
    is again a factor of M smaller:
      S = Omega ( (n/m)^{1/2} * p^{1/2} ) 
    which is also attainable when m=n.

    But the lower bound on W remains true if M exceeds its lower bound m*n/p.
    For example, if we have enough memory for c>1 copies of the data, so M = c*m*n/p,
    then the lower bound on W decreases by a factor of c^{1/2}, and the lower bound
    on S decreases by c^{3/2}.  This is attainable by matmul, at least when m=n, 
    up to c = p^{1/3}, and for W by QR and LU. On the other hand, we do not expect S 
    to decrease in all cases, because (at least for LU, and we conjecture for QR and 
    SVD, but not NE), there is an independent lower bound W*S = Omega(n^2), so if 
    W decreases, S must increase.

    We note that for "Strassen-like algorithms" (which requires some technical
    assumptions), the M^{1/2} in the above lower bounds is replaced by M^{w/2 - 1},
    where w is the exponent of matrix multiplication, and is attainable for
    various nonrandomized algorithms.

    Finally, for iterative methods, the communication costs are typically dominated 
    by the number of iterations, assuming the matrix is too big to fit in fast memory.

To summarize goals (1) (flops) and (2) (communication):
(1) Direct (nonrandomized) methods cost O(mn^2) or O(sum_i n(i)^2) flops,
    where n(i) is the number of nonzeros in A(i,:).
    Iterative (nonrandomized) methods cost O(nnz(A)) flops per iteration, and
    may take O(kappa(A)) iterations.
    Randomized goal: O(nnz(A)) flops, i.e. a constant number of flops per nonzero.
(2) Direct: when n^2 <~ M = fast memory size, can access data just once
    Iterative: when A does not fit in fast memory, access data O(kappa(A)) times.
    Randomize goal: access data once, or O(1) times

(3) Accuracy: For LS, we can ask either about the error in the solution x or the
    residual r = A*x-b. The usual nonrandomized goal is called backward stability, 
    i.e. getting the exact solution for a slightly different input, since the input
    usually has uncertainty, so we ask for an answer "as accurate as the data
    deserves". So this metric of success is that the computed solution xhat satisfies
    (1)  xhat = argmin || (A+E)*xhat - (b+f) || 
         where E and f are "small" compared to A and b, resp. 
         say measured by eps = norm(E)/norm(A) + norm(f)/norm(b) << 1
    We may use eps to bound the difference between xhat and the true solution x;
    to first order in eps:
    (2) || x-xhat || / || x || <= eps*( kappa(A)/cos(theta) + kappa^2(A)*tan(theta) )
    where kappa(A) is the condition number and sin(theta) = norm(r)/norm(b),
    i.e. theta is the angle between b and it best approximation A*x. So this error
    bound can be very large in two situations:
        A is ill-conditioned, i.e. kappa(A) is large, and
        theta is close to pi/2, i.e. b is nearly orthogonal to the columns of A
    The residuals r = A*x-b and rhat = (A+E)*xhat - (b+f) can differ by
    (3)  || r - rhat || / || b || <= (1 + 2*kappa(A)) * eps
    so the residual is less sensitive.

    Deterministic algorithms based on QR and SVD are backward stable and so attain
    these bounds with eps = O(machine epsilon) = O(10^(-16)) in double precision.
    NE is not backward stable, and only guarantees
    (4)   || x - xhat || / || x || <= O(machine epsilon) * kappa^2(A)
    even when tan(theta) is small.

    For example, Blendenpik (see below) is "about as stable as QR" in the words 
    of the authors.

    Finally, in some applications we might only want the weaker criterion that
    (5)  || A*xhat - b || <= (1 + eps) * || A*x-b ||
    where xhat is the computed solution and A*xhat-b its residual, and A*x-b is the
    true residual, i.e. the fit is about as good in norm.  Clarkson-Woodruff 
    (see below) makes such a guarantee for their algorithm:

    Thm 1 (CW): With probability at least 2/3, norm(A*x-b,2) is at most (1+eps) times
    larger than the true minimum, with cost O(nnz(A) + "lower order terms") where the
    "lower order terms" grow like O(n^3*log^6(n/eps)/eps^2).

    This theorem illustrate a more general feature of randomized algorithms:
    they depend on a parameter eps, which is the desired accuracy, and the cost 
    (eg number of random samples) grows as epsilon decreases. Furthermore,
    they have a probability of failure delta (delta = 1/3 above) that may also
    decrease as epsilon decreases.

    To summarize, we have presented 5 metrics of accuracy, in order from strongest 
    to weakest ((1) implies (2) and (3), but (2) and (3) could potentially hold 
    without (1)). The cost of a randomized algorithm will generally depend on the 
    desired accuracy epsilon, and possibly the desired probability of failure.

    (It is also possible to use iterative refinement and extra precision to get even 
    higher precision than (2) and (3), but we will not pursue this here.) 

Various LS problems:

We will only discuss the first one, randomized algorithms for many of the others 
remain to be explored, afaik:
   Full rank, overdetermined: argmin  || A*x - b  ||
   "Damped" overdetermined:  argmin || [A; lambda*I] *x - [b;0] ||
     aka ridge regression, or Tikhonov regularization with lambda*B instead
   Low rank, overdetermined: choose x minimizing both || A*x - b|| and || x ||
   Weighted:  argmin || W*(A*x - b) ||
   Constrained: choose x minimizing || A*x - b || such that B*x = c, or
       || x || <= tau, or || B*x - c || <= tau
   Underdetermined versions of these

Randomized LS solvers:

  Citations below refer to the following papers:
    RT = Rokhlin, Tygert
    MSM = Meng, Saunders, Mahoney
    DMMS = Drineas, Mahoney, Muthukrishnan, Sarlos 
    DMM = Drineas, Mahoney, Muthukrishnan
    AMT = Avron, Maymounkov, Toledo 
    CW = Clarkson, Woodruff

  There are two general approaches:
  (R1) Let S be a s x m random matrix, let A' = S*A and b' = S*b. Then solve the 
       s x n LS problem argmin || A'*x' - b' ||. When s << m, and S*A is not too 
       expensive, this can be much cheaper than the original problem.
       This approach includes
          Random row sampling as presented in DMMS ((1) below)
          Leverage score sampling as presented in DMMS ((2) below)
          CW ((7) below)
           
  (R2) Again form A'=S*A, and factor A' = Q*R; more generally we only require
       that the factor Q be well-conditioned, and R both square and that it be easy to
       solve Rx=y for x (eg R is triangular). Then solve the LS problem iteratively
       with R as a preconditioner; more precisely solve argmin || A*inv(R)*y - b ||
       iteratively for y, and then let x = inv(R)*y.  By construction A*inv(R) will be
       well-conditioned w.h.p. and so take a bounded number of iterations to converge.
       These iterative methods are typically mathematically equivalent to a method
       like CG for square linear systems applied to the normal equations 
       (A^T*A)*x = A^T*b; A^T*A is not formed explicitly, rather we just need to 
       multiply vectors (A^T*A)*y = A^T*(A*y) using two matrix-vector multiplications.
       This approach includes
          RT  ((4) below)
          LSRN as presented in MSM  ((5) below)
          Blendenpik of AMT ((6) below)

   Both approaches depend on the Johnson-Lindenstrauss Lemma (JL) and its successors,
   which say that projecting n m-vectors into a suitably randomly chosen s-dimensional
   subspace, i.e.  multiplying them by a s x m matrix  with orthonormal rows (call it
   S') will reasonably accurately preserve the distances between all pairs of vectors.
   So solving the projected least squares problem argmin || S'*(A*x -b) || should 
   give about the same answer.

   S and S' are not the same; the original S' of JL, a "uniformly distributed"
   random orthonormal matrix, i.e. with Haar measure, may be computed by doing
   the QR factorization of an m x s matrix of i.i.d. N(0,1) random variables,
   and letting S' = Q^T. But computing S' this way is at least as expensive as 
   solving the original LS problem directly (and more if s > n, as it must be).
   So later approaches to (R1) and (R2) try to use much cheaper S matrices.
   We now illustrate some of these approaches, from simple to more sophisticated 
   (and without any proofs), and indicate both the papers which they appears and
   whether the algorithms use approach R1 or R2:

  (1) (DMMS / R1) Choose s rows of A and b randomly with equal probability, and call
      the resulting matrix A'.  This is equivalent to letting S be an s x m matrix
      where each row is zero except for a 1 in a random column, and letting A'=S*A.
      Similarly, b'=S*b. Then if A is "random" and s = O(n log(m) log(n log(m))), then
      || A*xhat - b || <= (1 + eps) || A*x-b || with "high probability".
      Of course we can't assume A is random in general.

  (2) (DMM / R1) Suppose some column of A has just one nonzero entry, say A(i,j).  
      Then unless we are lucky enough to sample row i, A' will have a zero column, and
      so be singular, making x' ill-defined. So we clearly need to assign a higher
      sampling probability to row i.

      Since A(i,j) is the only nonzero in column j, then in A=Q*R, Q(i,j+1:end) =0,
      since later columns of Q are all orthogonal to column j of A. Thus, recalling
      that Q is m x n, || Q(i,:) || = 1. This motivates the definition

      Def: the coherence of A is mu(A) = max_i || Q(i,:) ||^2.

      The coherence satisfies n/m <= mu(A) <= 1, the upper bound because Q is
      orthogonal, the lower bound because sum_i || Q(i,:) ||^2 = n.
      If mu(A) is near 1, i.e. the matrix is coherent, then some rows are more
      "important" than others, whereas if mu(A) is near n/m, i.e. A is incoherent,
      then rows are more nearly equally important.
      || Q(i,:) ||, also called "leverage score",  measures how important row i is to 
      sample.  Let p_i = || Q(i,:) ||^2 / n be the sampling probability for row i;
      note that sum_{i=1 to n} p_i = 1.

      Thm 2: Let A' be obtained by sampling (with replacement) s rows of A where row i
           is chosen with probability p_i = ||Q(i,:)||^2 /n. This is equivalent to
           letting S be an s x m matrix with each row is zero except for a 1 in
           column i, with probability p_i, and letting A'=S*A. Solve 
           argmin ||A'*x' - b'||. Then (under some additional conditions on b), we get
               || A*x' - b || <= (1+eps) * ||A*x-b|| with probability 1-delta
           provided we sample at least this many rows:
               s = Omega( n^2*ln(3/delta) / eps^4 )

      Note that the small LS problem, argmin || A'*x' - b' ||, would cost 
      O(sn^2) = O(n^4 * ln(3/delta) / eps^4) if done using a dense deterministic 
      problem, which is potentially much cheaper than a deterministic O(mn^2) 
      algorithm on the original problem, if m >> n.  On the other hand, computing the
      p_i in the first place would cost as much as a QR decomposition, i.e. O(mn^2).

  (3) (AMT / R2) There is an analogous result to the above theorem, which says that
      if we use uniform random sampling to get A' = S*A as in (1), then
      for inv(R') from A' = Q'*R' to be a good preconditioner for LSQR,
      the number of sampled rows s has to grow at least as fast as m*mu(A), so
      O(m) for a coherent matrix down to O(n) for an incoherent matrix, i.e.
      random sampling of a coherent matrix does not work well, as before:

      Thm 3 (AMT): Let A' be s randomly sampled rows of A, with uniform probability.
      Let tau = C * sqrt(m * mu(A) * log(s) / s ), and tau/delta < 1. Then with
      probability at least 1 - delta, 
         kappa(A*inv(R')) <= (1 + tau/delta)/(1 - tau/delta)

      We will use this to analyze Blendenpik below.

  (4) (RT / R1 and R2) In the above examples S was as cheap as possible to multiply 
      by, since forming A'=S*A was equivalent to choosing randomly selected rows of A.
      Alternatively, we can pick a more expensive (denser) S to multiply by,
      where multiple nonzeros each each row of S "mix" different rows of A, and
      so avoid the coherence problem. There are many different examples in the
      literature, and we mention several of them, starting with the first (RT).
      Here S = G*H where
         G is s x m, and G = S'*F*D where
           S' is s x m and samples s rows uniformly at random as in the first example
           F is an m x m DFT, so cheap to multiply by
           D is an m x m diagonal matrix where D(i,i) is i.i.d. uniformly on the unit
             circle in the complex plane, and
         H is m x m and the product of m x m matrices:
           H = Theta * Pi * Z * Theta' * Pi' * Z' where
               Theta and Theta' are each the product of m-1 i.i.d. random Givens
                  rotations [cos(t(i)), sin(t(i)) ; -sin(t(i)), cos(t(i)) ]
                  on the coordinates (1,2), (2,3), ... (m-1,m)
               Pi and Pi' are independent random permutations
               Z and Z' are like D above
     The point is that multiplying a vector of A by S is relatively cheap, O(m log s)
     because the (sparse) DFT is the dominant cost, and that the patterns of the 
     various factors of G and H "mix" the rows of A and avoid coherence problems.

     Thm 4 (RT/R1) Choose alpha > 1 and beta > 1, and the number of rows of S equal to
           s >= (alpha^2+1)/(alpha^2-1) * beta * (n+1)^2
     Then the solution x' = argmin || A'*x' - b' || where A'=S*A and b'=S*b
     satisfies || A*x' - b || <= alpha * || A*x - b || with probability 1 - 1/beta.

     Note that A' has O(n^2) rows, so solving for x' would cost O(n^4) using a direct
     method, for a total cost of O(m*n*log(n) + n^4).
     RT also uses an iterative approach R2, forming A' = Q*R and using inv(R) as a 
     preconditioner:

     Thm 5 (RT/R2) Choosing s as in Thm 4, guarantees kappa(A*inv(R)) <= 3, so
     an iterative algorithm converges in a finite number of steps.

     Unfortunately, Thm 4 says s = Theta(n^2), which is very large, again leading to a
     total cost of O(m*n*log(n) + n^4), but for a possibly much more accurate answer.
     RT's numerical experiments show that s = 4*n, which is much cheaper, is enough in
     practice.

 (5) (MSM / R2) This paper presents the LSRN algorithm. Here S is chosen with
     i.i.d. N(0,1) random variables, with s ~ 2*n, and the preconditioner is gotten 
     by taking the SVD of S*A = A' = U * Sigma * V^T  = U * R, and using inv(R) as a 
     preconditioner. The paper addresses the rank deficient case too, for simplicity 
     we consider the full rank case:

     Thm 6 (MSM) The reciprocal singular values of A*inv(R) have the same probability
     distribution as the singular values of an s x n i.i.d. N(0,1) matrix.
     This means that the probability of a very large or small singular value
     falls off exponentially fast, so convergence is fast w.h.p.

     Since S is a dense matrix, multiplying A' = S*A costs 2*s*nnz(A) = 4*m*n^2 in
     the case of dense A, so more than the usual direct algorithm in the dense case,
     but possibly much less in the sparse case.

 (6) (AMT / R2) This paper presents the Blendenpik algorithm. Here S is chosen
     very simply as S = F*D, where F can be any fixed s x m matrix with orthogonal
     rows, and D is diagonal with D(i,i) = +-1, i.i.d. with probability .5.
     A simple property of F guarantees that S*A will have low coherence w.h.p.:

     Thm 4 (AMT) Let F' be any m x m orthogonal matrix, and D = diag(+-1) where
     each entry is i.i.d. with probability .5.  Let f = max_{ij} |F'(i,j)|^2.  
     Then with probability at least .95, mu(F'*D*A) <= C*n*f*log(m)   
     (here C depends on the choice .95).

     Since 1/m <= f <= 1, mu(F'*D*A) can potentially be as small as O(n/m * log(m)),
     near the lower bound n/m, which means, by Thm 3, that we only need to sample
     O(n*polylog(m)) rows of F*D*A to get S*D*A = A' such that inv(R') 
     (where A'=Q'*R') is a good preconditioner with high probability.

     There are a number of good choices of F' including FFT, DCT, Walsh-Hadamard,
     etc. that minimize f and can be applied to a column of A in m*log(m) flops.
     Thus the overall cost of the algorithm for a dense matrix is O(m*n*log(m) + n^3)
     In the paper, AMT show speedups of up to 4.5x over the dense QR algorithm.

 (7) (CW / R1) Finally we discuss the Clarkson-Woodruff algorithm. Let eps < 1
     be an error tolerance and s = O((n/eps)^2*log^6(n/eps)).  Let S = F*D
     where F is s x m is a uniformly randomly selected column of eye(s), and
     D is diagonal with D(i,i) = +-1 i.i.d. with probability .5. Thus forming
     A'=S*A costs just O(nnz(A)). To repeat the analysis stated above:

     Thm 1: With probability at least 2/3, norm(A*x-b,2) is at most (1+eps) times
     larger than the true minimum, with cost O(nnz(A) + "lower order terms")
     where the "lower order terms" grow like O(n^3*log^6(n/eps)/eps^2).