Notes for Randomized Numerical Linear Algebra, 24 Feb 2015
Starting today, we will videocapture each lecture, broadcast them live, and
archive them on youtube. Just for today, the link will be
https://www.youtube.com/watch?v=2w9y5YAOfEE
We will post instructions for future lectures on the website.
In addition to material on the class website
https://who.rocq.inria.fr/Laura.Grigori/RandNLA.html
we borrow from material presented in Ma221 in Fall 2014, at
http://www.cs.berkeley.edu/~demmel/ma221_Fall14/Lectures/Lecture_19.txt
Papers on recent communication avoiding algorithms like TSQR may be found at
bebop.cs.berkeley.edu
Our goals in this introductory lecture are
(1) to discuss the metrics for performance, accuracy and reliability that we want
to use to compare randomized and nonrandomized algorithms (both direct and
iterative), which we hope will inform discussions later in the semester too;
(2) recall different kinds of Least Squares (LS) problems, and for the simplest case
of full-rank, overdetermined problems, the usual nonrandomized algorithms
(direct: NE = Normal Equations, QR and SVD, and iterative: LSQR) and what
their metrics are;
(3) discuss randomized LS solvers, including Blendenpik, Clarkson-Woodruff, ...
Metrics:
(1) #Flops: This may just depend on the matrix dimensions, or also sparsity structure.
For dense, square n x n matrices, direct linear algebra algorithms typically cost
O(n^3). Given a fast matmul algorithm costing O(n^w), there are "numerically
stable" O(n^(w+ eps)) algorithms, for any eps>0; see "Fast Linear Algebra is
Stable" by Demmel/Dumitriu/Holtz for details. In at least one case, the
nonsymmetric eigenproblem, the only algorithm attaining this uses randomization.
For m x n matrices with m>n, for example the overdetermined LS problem, the
complexities are O(m*n^2) and O(m*n^(w - 1 + eps)). We will mostly consider
O(n^3) or O(m*n^2) algorithms, which are often (but not always) more practical.
For sparse matrices, with nnz(A) = number of nonzeros in A << m*n, we seek
both direct and iterative algorithms that do much less arithmetic.
Suppose A has n(i) nonzeros in row i, so nnz(A) = sum_i n(i). Looking ahead,
we will have nonrandomized direct algorithms for LS with m >> n that cost
O(sum_i n(i)^2) flops (plus ``lower order terms''), which will be a lower bound
under certain assumptions, in contrast to a randomized algorithm that costs
O(sum_i n(i)) = O(nnz(A)) flops (Clarkson-Woodruff), or O(m*n) in the dense case.
Iterative methods like LSQR are also used, where the inner loop is typically
matrix-vector multiplication and so does 2*nnz(A) flops. The number of iterations
depends on numerical properties of the matrix, like the distribution of its
singular values, often summarized by the condition number
kappa(A) = sigma_max(A)/sigma_min(A). Looking ahead, we will describe a
nonrandomized iterative method that does O(kappa(A)) iterations, and a
randomized one that does O(1) iteration w.h.p. (Blendenpik), and so again costs
O(m*n) in the dense case.
(2) Communication cost: Communication means moving data, either between levels of
memory hierarchy or between parallel processors over a network, and costs
orders of magnitude more per operation than a flop. #Flops is a good metric
for small problems that fit in cache, but the motivation for randomized
algorithms are very large problems where this is not the case.
So our performance model for sending one "message" of n contiguous words will be
alpha + n*beta = latency + n/bandwidth
where a "message" could be a cache line in a shared memory system, an actual
message in a distributed memory system (like Hopper and Edison), or a disk access.
Our overall performance model (along critical path of an algorithm) is then
gamma*#Flops + beta*#Words_sent + alpha*#Messages
= gamma*F + beta*W + alpha*S
where gamma << beta << alpha, with the gaps growing over time following
technological trends. We can use the same model for energy, where gamma etc
have units of joules instead of seconds, since energy usage is becoming
at least as important metric as run-time (and they are obviously correlated).
There has been a lot of recent work on communication lower bounds, starting with
linear algebra, but extending to code that can be described abstractly as
“for all k-tuples of integers in some some subset of Z^k, execute some code that
accesses arrays whose subscripts are affine functions of those integers.” This
class of algorithms clearly includes matrix multiplication C = A*B, with 3-tuples
of integers (i,j,k) accessing arrays A(i,k) B(k,j) and C(i,j), but most other
linear algebra algorithms (LU, QR, …), for dense or sparse matrices, and either
sequential or parallel implementations. And it goes beyond classical linear
algebra, including fast linear algebra (eg Strassen's method), some graph
algorithms, direct n-body algorithms, tensor contractions, etc.
For this class, we will just state the lower bound for (most) classical linear
algebra, and use it as a goal against which to measure our algorithms. Since
randomized algorithm often use classical linear algebra as building blocks, it
will apply at least to components of them.
To state the result, consider a 2-level memory hierarchy, called fast and slow,
where fast could equal cache and slow = DRAM, or fast = DRAM and slow = disk,
or fast = local memory of one parallel processor, and slow = memory of other
parallel processors connected by a network. We assume that arithmetic can
only be done on data in the fast memory. We want to lower bound
W = number of words moved between fast and slow memory
= Omega(max( #iterations / M^{1/2}, #inputs, #outputs )
where the number of iterations refers to the inner loop of the algorithm,
say a multiply-add for matrix multiply or classical linear algebra more generally.
In the case of a overdetermined least squares with a dense m x n input matrix,
running sequentially, #iterations = Theta(m n^2), so this bound becomes
W = Omega ( max( m*n^2 / M^{1/2} , m*n ) )
= Omega ( m*n ) when n <= M^{1/2}
In other words, the lower bound is the size of the input, raising the question
of whether we can really just access the data once, even when it is arbitrarily
large, provided n <= M^{1/2}, or n^2 <= M, i.e. we can fit a square
submatrix in memory.
Def: A "streaming algorithm" can solve an arbitrarily large problem that is
read once into fast memory (say a row at a time).
Note that a streaming algorithm does not even require the data to exist entirely
at one time, eg sensors could be producing data in real time that is streamed
into a computer and then thrown away. If we instead assume that the data is in
fact stored in slow memory, and so can be accessed more than once, we can ask for
a slightly weaker goal:
Def: A "weakly streaming algorithm" can solve a problem by reading it from
slow to fast memory O(1) times.
TSQR is a recent deterministic streaming algorithm. It targets dense matrices,
but could be modified to take (some) advantage of sparsity.
We will later describe randomized (weakly) streaming LS algorithms.
Regarding the latency cost, given a lower bound on W, a lower bound on the
number of messages is simply given by S >= W/largest_message_size. The
largest message size may be determined by characteristics of the HW of SW,
but is obviously bounded by M, the fast memory size to which the message is
delivered. Thus
S = Omega(max( #iterations / M^{3/2}, #inputs/M, #outputs/M )
Lower bounds on both W and S can be attained by deterministic algorithms,
TBD for randomized ones.
In the parallel case, with p processors, where fast memory is the local memory of
one processor and slow memory is the memory of other processors, we may consider a
load balanced algorithm so the lower bound becomes
W = number of words moved between one processor and other processors
= Omega( (#iterations/p) / M^{1/2} ).
The smallest we can choose M, for the dense m x n case, is M = m n / p,
i.e. there is just enough memory to store the problem, leading to the lower bound
W = Omega ( (m n^2 / p)/(m n / p)^{1/2)
= Omega ( m^{1/2} n^{3/2} / p^{1/2} )
which is attainable by recent deterministic algorithms. The latency lower bound
is again a factor of M smaller:
S = Omega ( (n/m)^{1/2} * p^{1/2} )
which is also attainable when m=n.
But the lower bound on W remains true if M exceeds its lower bound m*n/p.
For example, if we have enough memory for c>1 copies of the data, so M = c*m*n/p,
then the lower bound on W decreases by a factor of c^{1/2}, and the lower bound
on S decreases by c^{3/2}. This is attainable by matmul, at least when m=n,
up to c = p^{1/3}, and for W by QR and LU. On the other hand, we do not expect S
to decrease in all cases, because (at least for LU, and we conjecture for QR and
SVD, but not NE), there is an independent lower bound W*S = Omega(n^2), so if
W decreases, S must increase.
We note that for "Strassen-like algorithms" (which requires some technical
assumptions), the M^{1/2} in the above lower bounds is replaced by M^{w/2 - 1},
where w is the exponent of matrix multiplication, and is attainable for
various nonrandomized algorithms.
Finally, for iterative methods, the communication costs are typically dominated
by the number of iterations, assuming the matrix is too big to fit in fast memory.
(3) Accuracy: For LS, we can ask either about the error in the solution x or the
residual r = A*x-b. The usual nonrandomized goal is called backward stability,
i.e. getting the exact solution for a slightly different input, since the input
usually has uncertainty, so we ask for an answer "as accurate as the data
deserves". So we compare the solutions of the two LS problems
x = argmin || A*x-b || and xhat = argmin || (A+E)*xhat - (b+f) ||
where all norms are 2-norms, and E and f are "small" perturbations of A and b,
resp., measured by eps = norm(E)/norm(A) + norm(f)/norm(b). To first order:
|| x - xhat || / || x || <= eps * ( kappa(A)/cos(theta) + kappa^2(A)*tan(theta) )
where kappa(A) is the condition number and sin(theta) = norm(r)/norm(b),
i.e. theta is the angle between b and it best approximation A*x. So this error
bound can be very large in two situations:
(1) A is ill-conditioned, i.e. kappa(A) is large, and
(2) theta is close to pi/2, i.e. b is nearly orthogonal to the columns of A
The residuals r = A*x-b and rhat = (A+E)*xhat - (b+f) can differ by
|| r - rhat || / || b || <= (1 + 2*kappa(A)) * eps
so the residual is less sensitive.
Deterministic algorithms based on QR and SVD are backward stable and so attain
these bounds with eps = O(machine epsilon) = O(10^(-16)) in double precision.
NE is not backward stable, and only guarantees
|| x - xhat || / || x || <= O(machine epsilon) * kappa^2(A)
even when tan(theta) is small.
Blendenpik is "about as stable as QR" in the words of the authors.
Finally, in some applications we might only want the weaker criterion that
|| A*xhat - b || <= (1 + eps) * || A*x-b ||
where xhat is the computed solution and A*xhat-b its residual, and A*x-b is the
true residual, i.e. the fit is about as good in norm. Clarkson-Woodruff makes
such a guarantee:
Thm 1: With probability at least 2/3, norm(A*x-b,2) is at most (1+eps) times
larger than the true minimum, where the "lower order term" cost of the algorithm
grows like O(n^3*log^6(n/eps)/eps^2).
Various LS problems:
We will only discuss the first one, randomized algorithms for many of the others
remain to be explored, afaik:
Full rank, overdetermined: argmin || A*x - b ||
"Damped" overdetermined: argmin || [A; lambda*I] *x - [b;0] ||
aka ridge regression, or Tikhonov regularization with lambda*B instead
Low rank, overdetermined: choose x minimizing both || A*x - b|| and || x ||
Weighted: argmin || W*(A*x - b) ||
Constrained: choose x minimizing || A*x - b || such that B*x = c, or
|| x || <= tau, or || B*x - c || <= tau
Underdetermined versions ...
Randomized LS solvers:
Blendenpik (Avron/Maymounkov/Toledo = AMT, building on earlier work of
Rokhlin/Tygert = RT, and Drineas/Mahoney/Muthukrishnan/Sarlos = DMMS)
Idea: Normal equations are symmetric positive definite (s.p.d.) linear system
(A^T*A)*x = A^T*b. If we form A^T*A, we could solve directly with Cholesky
or iteratively using Conjugate Gradients, which would mean we would repeatedly
multiply vectors by A^T*A. But the cost of forming A^T*A typically dominates latter
costs when m >> n, so we can instead multiply (A^T*A)*x by two matrix-vector
multiplies A^T*(A*x). Reorganizing this to be more numerically stable yields the
LSQR algorithm of Paige/Saunders (1982/1973). The convergence rate is the same
as CG, so taking O(kappa(A)) steps to converge to a given residual. This can be
improved by preconditioning, i.e. solving argmin || A*M*y - b || for y, and then
taking x = M*y. This will be faster if (1) kappa(A*M) << kappa(A), and
(2) multiplying by M and M^T is cheap. A perfect choice would be M = inv(R) from
the QR factorization A = Q*R, since (1) A*M would be orthogonal with kappa(A*M)=1,
and (2) multiplying by inv(R) or inv(R^T) is a triangular solve (cheap).
But computing R would be as expensive as a deterministic method. So instead we will
use a randomized sampling of A to get an "approximation" Ahat, and then use the R
factor of Ahat.
Approaches to sampling A (from simple to sophisticated)
(1) (DMMS): Choose d rows of A and b randomly with equal probability, call d x n
matrix Ahat, and similarly bhat, and solve xhat = argmin || Ahat*xhat - bhat ||.
Then if A is "random" and d = O(n log(m) log(n log(m))), then
|| A*xhat - b || <= (1 + eps) || A*x-b || with "high probability".
Of course can't assume A is random in general.
(2) (DMM): Suppose some column of A has just one nonzero entry, say A(i,j). Then
unless we are lucky enough to sample row i, Ahat will have a zero column, and
so be singular, making xhat ill-defined. So we clearly need to assign a higher
sampling probability to row i.
Since A(i,j) is the only nonzero in column j, then in A=Q*R, Q(i,j+1:end) =0,
since later columns of Q are all orthogonal to column j of A. Thus, recalling
that Q is m x n, || Q(i,:) || = 1. This motivates the definition
Def: the coherence of A is mu(A) = max_i || Q(i,:) ||^2.
The coherence satisfies n/m <= mu(A) <= 1, the upper bound because Q is
orthogonal, the lower bound because sum_i || Q(i,:) ||^2 = n.
If mu(A) is near 1, i.e. the matrix is coherent, then some rows are more
"important" than others, whereas if mu(A) is near n/m, i.e. A is incoherent,
then rows are more nearly equally important.
|| Q(i,:) || measures how important row i is to sample.
Let p_i = || Q(i,:) ||^2 / n be the sampling probability for row i.
Thm 2: Let Ahat be obtained by sampling (with replacement) d rows, where row i
is chosen with probability p_i = ||Q(i,:)||^2 /n. Then (under some
additional conditions on b), we get
|| A*xhat - b || <= (1+eps) * ||A*x-b|| with probability 1-delta
provided we sample at least this many rows:
d = Omega( n^2*ln(3/delta) / eps^4 )
Note that the small LS problem, argmin || Ahat*xhat - bhat || ,
would cost O(dn^2) = O(n^3 * ln(3/delta) / eps^4) if done using
a dense deterministic problem, which is potentially much cheaper
than a deterministic O(mn^2) algorithm on the original problem.
On the other hand, computing the p_i in the first place would cost
as much as a QR decomposition, i.e. O(mn^2).
There is an analogous result to the above theorem (AMT), which says that
for inv(Rhat) from Ahat = Qhat*Rhat to be a good preconditioner for LSQR,
the number of sampled rows d has to grow at least as fast as m*mu(A), so
O(m) for a coherent matrix down to O(n) for an incoherent matrix:
Thm 3 (AMT): Let Ahat be d randomly sampled rows of A, with uniform probability.
Let tau = C * sqrt(m * mu(A) * log(d) / d ), and tau/delta < 1. Then with
probability at least 1 - delta,
kappa(A*inv(Rhat)) <= (1 + tau/delta)/(1 - tau/delta)
(3) (AMT) A way to get around coherence is to get Ahat from "randomly mixing"
rows of A, i.e. Ahat = F*A where F is a d x m random matrix. A good
choice of F will mean (1) A and F*A have similar singular values, so
R from A = Q*R and Rhat from Ahat = F*A = Qhat*Rhat will be similarly
good preconditioners for LSQR, and (2) F is cheap to multiply by.
For example, F could be a random orthonormal matrix with Haar distribution,
i.e. "uniformly distributed" over all orthonormal matrices. This may be
computed by starting with an m x d matrix B with i.i.d. N(0,1) entries,
computing B = Q_B * R_B and letting F = Q_B^T. But this is clearly more
expensive than solving the original problem, if d > n.
Fortunately, the following theorem shows that a much wider set of F matrices
works, including very cheap ones to multiply by:
Thm 4 (AMT) Let F' be any m x m orthogonal matrix, and D = diag(+-1) where
each entry is chosen by flipping a fair coin (Rademacher random variables).
Let f = max_{ij} |F'(i,j)|^2. Then with probability at least .95,
mu(F'*D*A) <= C*n*f*log(m)
(here C depends on the choice .95).
Since 1/m <= f <= 1, mu(F'*D*A) can potentially be as small as O(n/m * log(m)),
near the lower bound, which means, by Thm 3, that we only need to sample
O(n*polylog(m)) rows of F'*D*A to get an Ahat such that Rhat is a good
preconditioner with high probability.
There are a number of good choices of F' including FFT, DCT, Walsh-Hadamard,
etc. that minimize f and can be applied to a column of A in m*log(m) ops.
This leads us to state the Blendenpik algorithm:
choose sampling parameter gamma = O(1)
form M = F'*D*A
choose each row of M with probability gamma*n/m to get Ahat
... so the expected number of rows chosen is gamma*n
Ahat = Qhat*Rhat
run LSQR preconditioned with inv(Rhat) to solve argmin || A*x-b ||
In AMT, compared to the existing LAPACK implementation, Blenderpik with up to 4.5x
faster on 120000 x 3000 matrices. This was using the implementation of QR in LAPACK
before TSQR, so it would be interesting to compare Blendenpik to TSQR. Depending
on the relative costs of arithmetic and communication, one or the other may be
faster. We note that Blendenpik is a "weakling streaming" algorithm by our
previous definition.