Notes for Ma221 Lecture 10, Mar 29, 2022 Chapter 5 on symmetric eigenvalue problems and the SVD Goals: Perturbation Theory Algorithms for the symmetric eigenproblem and SVD Perturbation Theory The theorems we present are useful not just for perturbation theory, but understanding why algorithms work. Everything from chapter 4 applies to symmetric matrices, but much more can be shown. We consider only real, symmetric A = A^T. The case of complex Hermitian (A = A^H) is similar. Then we know that we can write A = Q*Lambda*Q^T where Lambda = diag(lambda_1 , ... , lambda_n) with real eigenvalues, and Q = [q_1,...,q_n] consists of real orthonormal eigenvectors. We'll assume lambda_1 >= ... >= lambda_n. Note that the complex symmetric case is totally different: A = [[1 , i];[i , -1]] has 2 eigenvalues at 0, and is not diagonalizable. Most of what we say will also be true for the SVD of an arbitrary matrix: Recall that the SVD of A is simply related to the eigendecomposition of the symmetric matrix (see Thm 3.3 part 4) B = [ 0 A ], [ A^T 0 ] with eigenvalues of B = +- singular values of A (plus some zeros if A is rectangular). So a small change from A to A+E is also a small symmetric change to B. So if we show this doesn't change the eigenvalues of B very much (in fact by no more than norm(E)), then this means the singular values of A can change just as little. Def: Rayleigh quotient rho(u,A) = u^T*A*u/u^T*u. Properties: if A*q = lambda*q, then rho(q,A) = lambda if u = sum_i b_i*q_i = Q*b where b = [b_1;...;b_n], then rho(u,A) = (Q*b)^T*A*(Q*b)/(Q*b)^T*(Q*b) = b^T * Q^T*A*Q * b / (b^T * Q^t*Q * b) = b^T * Lambda * b / (b^T * b) = sum_i lambda_i*b_i^2 / sum_i b_i^2 = convex combination of {lambda_1,...,lambda_n} so lambda_1 >= rho(u,A) >= lambda_n for all u and lambda_1 = max_u rho(u,A) lambda_n = min_u rho(u,A) In fact all the eigenvalues can be written using the Rayleigh quotient: Thm (Courant-Fischer Minimax Thm): Let R^j denote a j-dimensional subspace of n-dimensional space, and S^(n-j+1) denote an n-j+1 dimensional subspace. Then max_{R^j} min_{0 neq r in R^j} rho(r,A) = lambda_j = min_{S^(n-j+1)} max_{0 neq s in S^(n-j+1)} rho(s,A) The max over R^j is attained when R^j = span(q_1,...,q_j), and the min over r is attained for r = q_j. The min over S^(n-j+1) is attained for S^(n-j+1) = span(q_j,q_j+1,...,q_n) and the max over s is attained for s = q_j too. Proof: For any pair R^j and S^(n-j+1), since their dimensions add up to n+1, they must intersect in some nonzero vector x_RS. So min_(0 neq r in R^j) rho(r,A) <= rho(x_RS,A) <= max_(0 neq s in S^(n-j+1)) rho(s,A) Let R' maximize min_(0 neq r in R^j) rho(r,A) and Let S' minimize max_(0 neq s in S^(n-j+1)) rho(s,A) so max_{R^j} min_{0 neq r in R^j} rho(r,A) = min_{0 neq r in R'^j} rho(r,A) <= rho(x_R'S') <= max_{0 neq s in S'^(n-j+1)} rho(s,A) = min_{S^(n-j+1)} max_{0 neq s in S^(n-j+1)} rho(s,A) But if we choose R^j = span(q_1,..,q_j) and r = q_j, we get min_{0 neq r in R^j} rho(r,A) = rho(q_j,A) = lambda_j and if we choose S_(n-j+1) = span(q_j,..,q_n) and s = q_j, we get max_{0 neq s in S^(n-j+1)} rho(s,A) = rho(q_j,A) = lambda_j so the above inequalities are bounded below and above by lambda_j, and so must all equal lambda_j. (We pause the recorded lecture here.) Weyl's Theorem: if A is symmetric, with eigenvalues lambda_1 >= ... >= lambda_n and if A+E is symmetric, with eigenvalues mu_1 >= ... >= mu_n, then | lambda_j - mu_j | <= ||E||_2 for all j Corollary: If A general, with singular values sigma_1 >= ... >= sigma_n, and A+E has singular values tau_1 >= ... >= tau_n, then |sigma_j - tau_j| <= ||E||_2 for all j. Proof of Weyl: mu_j = min_{S^(n-j+1)} max_(0 neq s in S^(n-j+1)) rho(s,A+E) = min_{S^(n-j+1)} max_(0 neq s in S^(n-j+1)) s^T*(A+E)*s/s^T*s = min_{S^(n-j+1)} max_(0 neq s in S^(n-j+1)) [ s^T*A*s/s^Ts + s^T*E*s/s^T*s ] <= min_{S^(n-j+1)} max_(0 neq s in S^(n-j+1)) [ s^T*A*s/s^Ts + ||E||_2 ] = lambda_j + ||E||_2 Swapping roles of mu_j and lambda_j, we get lambda_j <= mu_j + ||E||_2, or what we want. Def: Inertia(A) = (#negative evals(A), #zero evals(A), #positive evals(A)) Sylvester's Theorem: If A is symmetric and X nonsingular, then Inertia(A) = Inertia(X^T*A*X) Fact: Suppose we do A = L*D*L^T (Gaussian elimination with symmetric or no pivoting). Then A and D have the same Inertia, namely (#D(i,i)<0, #D(i,i)=0, #D(i,i)>0), which is easy to compute. Now suppose we do A - x*I = L*D*L'; then #D(i,i)<0 = #eigenvalues of A-x*I < 0 = #eigenvalues of A < x Similarly suppose we do A - y*I = L'*D'*L'^T for some y > x, then we can similarly compute #eigenvalues of A <= y. Then #eigenvalues in [x,y] = (#evals <= y) - (#evals < x) so we can count the # eigenvalues in any interval [x,y]. Now suppose we count # evals < (x+y)/2; we can get the number number of evals in each half of the interval. By repeatedly bisecting the interval, we can can compute any subset of the eigenvalues as accurately as we like. We say more on how to do this cheaply later. Proof of Sylvester's Theorem: Suppose #evals of A < 0 is m, and #evals of X^T*A*X < 0 is m', but m' < m; let's find a contradiction. Let N be the m-dimensional subspace of eigenvectors for the m negative eigenvalues of A, so x in N means x^T*A*x < 0. and let P be the n-m' dimensional subspace of nonnegative eigenvalues of X^T*A*X, so x in P means x^T*X^T*A*X*x = (X*x)^T*A*(X*x) >= 0, or y in X*P means y^T*A*y >= 0. But dimension(X*P) = dimension(P) = n-m', and dimension(N)+dimension(X*P) = n-m'+m > n, so they intersect, i.e. there is some nonzero x in both spaces N and X*P, i.e. x^T*A*x < 0 and x^T*A*x >= 0, a contradiction. (We pause the recorded lecture here.) Now a little about eigenvectors: Thm: Let A = Q*Lambda*Q^T be the eigendecomposition of A, with Lambda = diag(lambda_i), and Q = [q_1,...,q_n], and A+E = Q' * Lambda' * Q'^T be the eigendecomposition of A+E, with Lambda' = diag(lambda'_i), and Q' = [q'_1,...,q'_n]. Let theta_i = angle between q_i and q'_i. We want to bound theta_i. Let gap(i,A) = min_{j neq i} |lambda_j - lambda_i|. then |.5*sin(2*theta_i)| <= ||E||_2 / gap(i,A) Note that when theta_i is small, .5*sin(2*theta_i) ~ theta_i In other words, if the perturbation ||E||_2 is small compared to the distance gap(i,A) to the nearest other eigenvalue, the change in direction of the eigenvector will be small. Proof of a weaker result: write eigenvector i of A+E as q_i + d where d is perpendicular to q_i (so q'_i = (q_i+d)/|| q_i+d ||_2 ). Then (A+E)*(q_i+d) = (lambda'_i)*(q_i+d). Ignoring second order terms we get A*q_i + E*q_i + A*d = lambda'_i*q_i + lambda'_i*d or (A+E-lambda'_i*I)*q_i = (lambda'_i-A)*d or (lambda_i*I + E - lambda'_i*I)*q_i = (lambda'_i*I - A)*d Write d = sum_{j \neq i} d_j*q_j yielding (lambda_i*I + E - lambda'_i*I)*q_i = sum_{j neq i} d_j*(lambda'_i-lambda_j)*q_j Taking 2-norms, the norm(LHS) <= 2*||E||_2 by Weyl's Theorem Taking 2-norms, the norm(RHS) >= ( gap(i,A) - ||E||)*||d||_2 So 2*||E||_2/(gap(i,A) - ||E||) >= ||d||_2 = |tan(theta_i)| . See the text for the full proof. (We pause the recorded lecture here.) Important fact about the Rayleigh Quotients: The Rayleigh Quotient is a best approximation to an eigenvalue in a certain sense: Thm: Suppose x is a unit vector and beta a scalar. Then A has an eigenvalue alpha such that | alpha - beta | <= || A*x - beta*x ||_2. Given only x, the choice beta = rho(x,A) minimizes || A*x - beta*x ||_2. So given any unit vector x, there is always an eigenvalue within distance || A*x - rho(x,A)*x ||_2 of rho(x,A), and rho(x,A) minimizes this bound. Furthermore, let lambda_i be the eigenvalue of A closest to rho(x,A), and gap = min_{j neq i} | lambda_j - rho(x,A) | be the distance to the next closest eigenvalue. Then | lambda_i - rho(x,A) | <= || A*x - rho(x,A)*x ||^2 / gap i.e. the error in rho(x,A) as an approximate eigenvalue is proportional to the square of the norm of the residual A*x - rho(x,A)*x. Proof of part 1: If x is a unit vector 1 = ||x|| = || inv(A - beta*I)*(A - beta*I) * x || so 1 <= || inv(A - beta_I) ||_2 * || A*x - beta*x ||_2 or 1 <= 1/min_i |lambda_i - beta| * || A*x - beta*x ||_2 To show that beta = rho(x,A) minimizes || A*x - beta*x ||_2, write A*x - beta*x = [ A*x - rho(x,A)*x ] + [ rho(x,A)*x - beta*x ] = y + z If we show z is orthogonal to y, then by the Pythagorean Theorem we are done: z^T*y = (rho(x,A) - beta) * x^T * ( A*x - rho(x,A)*x) = (rho(x,A) - beta) * ( x^T*A*x - rho(x,A)*x^T*x) = 0 Partial Proof of Part 2: We do just the special case of a 2x2 diagonal matrix, which seems very special, but has all the ingredients of the general case. So we assume A = diag(lambda_1,lambda_2), and x = [c;s] where c^2+s^2=1, so rho = c^2*lambda_1 + s^2*lambda_2. We'll assume c>s, so rho is closer to lambda_1 than lambda_2, so |lambda_1 - rho| = |lambda_1 - c^2*lambda_1 - s^2*lambda_2| = s^2*|lambda_1 - lambda_2| gap = |lambda_2 - rho| = |lambda_2 - c^2*lambda_1 - s^2*lambda_2| = c^2*|lambda_1 - lambda_2| r = A*x-rho*x = [ lambda_1*c ] - ( c^2*lambda_1 + s^2*lambda_2 ) * [ c ] [ lambda_2*s ] [ s ] = [ c*s^2*(lambda_1 - lambda_2) ] [ s*c^2*(lambda_2 - lambda_1) ] and so ||r||_2^2 = s^2*c^2*(lambda_1-lambda_2)^2*(s^2+c^2) = s^2*c^2*(lambda_1-lambda_2)^2 and ||r||_2^2/gap = s^2*|lambda_1 - lambda_2| = |lambda_1 - rho| exactly (whereas the theorem only claims an upper bound, in the general case). This result will be important later to help show that the QR algorithm from Chapter 4 converges cubically when applied to symmetric matrices (instead of "just" quadratically). (We pause the recorded lecture here.) Overview of Algorithms: There are several approaches, depending on what one wants to compute: (1) "Usual accuracy": backward stable in the sense that you get the exact eigenvalues and eigenvectors for A+E where ||E|| = O(macheps)*||A|| (1.1) Get all the eigenvalues (with or without the corresponding eigenvectors) (1.2) Just get all the eigenvalues in some interval [x,y] (with or without the corresponding eigenvectors) (1.3) Just get lambda_i through lambda_j (with or without the corresponding vectors) Ex: get the 10 largest eigenvalues (lambda_1 through lambda_10) (1.2) and (1.3) can be rather cheaper than (1.1) if only a few values wanted (2) "High accuracy": compute tiny eigenvalues and their eigenvectors more accurately than the "usual" accuracy guarantees. Ex: If A well-conditioned, so all singular values large, then error bound O(macheps*norm(A)) implies we can compute them with small relative errors. Now consider B = D*A where D is diagonal. If some entries of D are very small, and some O(1), B will be ill-conditioned, and so have some tiny singular values, and the usual error bound O(macheps*norm(B)) implies they will not be accurate. But in this case, there is a tighter perturbation theory, and an algorithm, that will still compute the tiny singular values of B with about as many correct digits as for A. For a survey of cases like this when high relative accuracy in tiny eigenvalues and singular values is possible, see the papers "Accurate and efficient expression evaluation and linear algebra," "New fast and accurate Jacobi SVD algorithm," "Computing the SVD with high relative accuracy," and "Jacobi's method is more accurate than QR," linked on the class web page. (3) Updating - given the eigenvalues and eigenvectors of A, find them for A with a "small change", small in rank, eg A +- x*x^T. All the above options also apply to computing the SVD, with the additional possibility of computing the left and/or the right singular vectors. Algorithms and their costs: (1) We begin by reducing A to Q^T*A*Q = T where Q is orthogonal and T is tridiagonal, costing O(n^3) flops. This uses the same approach as in chapter 4, where Q^T*A*Q was upper Hessenberg, since a matrix that is both symmetric and upper Hessenberg must be tridiagonal. This is currently done by LAPACK routine ssytrd. All subsequent algorithms operate on T, which is much cheaper. We have recently identified an algorithm for tridiagonal reduction that moves only O(n^3/sqrt(fast_memory_size)) words between fast and slow memory, attaining the lower bound. See "Avoiding Communication in Successive Band Reduction" linked on the class webpage for details. When A is banded, a different algorithm (in ssbtrd) takes advantage of the band structure to reduce to tridiagonal form in just O(n^2*bandwidth) operations. This can also be done while doing much less communication (the lower bound itself is an open question). In the case of the SVD, we begin by reducing the general nonsymmetric (even rectangular) A to U^T*A*V = B where U and V are orthogonal and B is bidiagonal, i.e. nonzero on the main diagonal and right above it. All subsequent SVD algorithms operate on B. This is currently done by LAPACK routine sgebrd. All the ideas for minimizing communication mentioned above generalize to the SVD. (1.1) Given T we need to find all its eigenvalues (and possibly vectors) There are a variety of algorithms; all cost O(n^2) just for eigenvalues, but anywhere from O(n^2) to O(n^3) for eigenvectors. Also some have better numerical stability properties than others. We summarize their properties here, and describe a few in more detail below. (1.1.1) Oldest is QR iteration, as in chapter 4, but for tridiagonal T. Thm (Wilkinson): With the right shift, tridiagonal QR is globally convergent, and usually cubically convergent (# correct digits triples at each step!). (We sketch a partial proof later.) It costs O(n^2) to get all the eigenvalues, but costs O(n^3) to get the vectors, unlike later routines. Since it only multiplies orthogonal matrices, it is backward stable by the same analysis as Chap 4. It is used by LAPACK routine ssyev. This was the standard approach for many years because of its reliability. LAPACK routine sgesvd uses a variant of QR iteration for the SVD, which has the additional property of guaranteed high relative accuracy for all singular values, no matter how small, if the input is bidiagonal (see ``Accurate singular values of bidiagonal matrices'' on the class webpage). (1.1.2) Another approach, which is much faster for computing eigenvectors (O(n^2) instead of O(n^3)) but does not guarantee that they are orthogonal, works as follows: (1) compute the eigenvalues alone (sstebz in LAPACK) (2) compute the eigenvectors using inverse iteration (sstein in LAPACK) x_(i+1) = inv(T - lambda(j)*I)*x_(i) Since T is tridiagonal, one steps of inverse iteration costs O(n), and since lambda(j) is an accurate eigenvalue, it should converge in very few steps. The trouble is that when lambda(j) and lambda(j+1) are very close, and so the eigenvectors are very sensitive, they may not be computed to be orthogonal, since there is nothing in the algorithm that explicitly enforces this (imagine what would happen if lambda(j) and lambda(j+1) were so close that they rounded to the same floating point number). Still, the possibility of having an algorithm that ran in O(n^2) time but guaranteed orthogonality was a goal for many years, with the eventual algorithm discussed below (MRRR). (1.1.3) Next is "divide-and-conquer", which is faster than QR, but not as fast as inverse iteration. It is used in LAPACK routine ssyevd (sgesdd for SVD). Its speed is harder to analyze, but empirically it is like O(n^g) for some 2 < g < 3. The idea behind it is used for the updating problem, i.e. getting the eigenvalues and vectors of A + x*x^T given those of A. (1.1.4) Most recent is MRRR = Multiple Relatively Robust Representations. which can be thought as a version of inverse iteration that does guarantee orthogonal eigenvectors, and still cost just O(n^2). It was developed by Prof. Parlett and former student Inderjit Dhillon, (see ``Orthogonal Eigenvectors and Relative Gaps'' on the class webpage). Since the output eigenvector matrix is of size n^2, it is optimal (but see below). It is implemented in LAPACK routine ssyevr. The adaptation of this algorithm for the SVD appeared in the prize-winning thesis of Paul Willems, but some examples of matrices remain where it does not achieve the desired accuracy, so it is still an open problem to make this routine reliable enough for general use. In theory, there is an even faster algorithm than O(n^2), based on divide-and-conquer, by representing the eigenvector matrix Z of T implicitly rather than as n^2 explicit matrix entries, but since most users want explicit eigenvectors, and the cost of reduction to tridiagonal form T is already much larger, we do not usually use it: Thm (Ming Gu): One can compute Z in O(n log^p n) time, provided we represent it implicitly (so that we can multiply any vector by it cheaply). Here p is a small integer. Thus A = Q*T*Q^T = Q*Z*Lambda*Z^T*Q^T = (Q*Z)*Lambda*(Q*Z)^T, so we need to multiply Q*Z to get final eigenvectors (costs O(n^3)) (1.2) or (1.3) Reduce A to Q^T*A*Q = T as above. Use bisection (based on Sylvester's Thm) on T to get a few eigenvalues, and then inverse iteration to get their eigenvectors if desired. This is cheap, O(n) per eigenvalue/vector of T, but does not guarantee orthogonality of eigenvectors of nearby eigenvalues ssyevx in LAPACK. MRRR (in ssyevr) could be used for this too, and guarantee orthogonality. (2) Based on Jacobi's Method, the historically oldest algorithm. The modern version is by Drmac and Veselic, called sgesvj for the SVD (not yet for the eigenproblem). (3) Use same idea as divide-and-conquer: assuming we have the eigenvalues and eigenvectors for A = Q*Lambda*Q^T, it is possible to compute the eigenvalues of A +- u*u^T in O(n^2), much faster than starting from scratch. (We pause the recorded lecture here.) We now illustrate important properties of some of these algorithms. Matlab demo of QR: illustrating cubic convergence n=6; A = randn(n,n), A = A+A', T = hess(A), s = T(n,n); [Q,R]=qr(T-s*eye(n)); T = R*Q+s*eye(n); off=triu(tril(T,-1),-1);T=off+off'+diag(diag(T)) ... the last line just ensures that T is symmetric and tridiagonal Cubic convergence will follow from analysis of a simpler algorithm called Rayleigh Quotient iteration: choose unit vector x_0 i = 0 repeat s_i = rho(x_i,A) = x_i^T*A*x_i y = (A - s_i*I)^(-1) * x_i x_(i+1) = y / ||y||_2 i = i+1 until convergence ( s_i and/or x_i stop changing ) This is what we called inverse iteration before, using the Rayleigh Quotient s_i as a shift, which we showed was the best possible eigenvalue approximation for the approximate eigenvector x_i. Suppose that A*q = lambda*q, ||q||_2 = 1, and ||x_i-q|| = e << 1. We want to show that ||x_(i+1)-q|| = O(e^3). We need to bound |s_i - lambda|; a first bound is s_i = x_i^T*A*x_i = (x_i - q + q)^T*A*(x_i - q + q) = (x_i-q)^T*A*(x_i-q) + q^T*A*(x_i-q) + (x_i-q)^T*A*q + q^T*A*q = (x_i-q)^T*A*(x_i-q) + lambda*q^T*(x_i-q) + (x_i-q)^T*q*lambda + lambda so s_i - lambda = (x_i-q)^T*A*(x_i-q) + 2*lambda*(x_i-q)^T*q so |s_i - lambda| = O( || x_i-q ||^2 + || x_i-q|| ) = O( || x_i-q || ) = O(e) A tighter bound is |s_i - lambda| <= || A*x_i - s_i*x_i ||^2 / gap ... where gap = distance from s_i to next closest eigenvalue ... we assume the gap is not too small ... This is Thm 5.5 in the text, proof sketch above = || A*(x_i - q + q) - s_i*(x_i - q + q) ||^2 / gap = || (A-s_i*I)*(x_i-q) + (lambda - s_i)*q ||^2 / gap <= ( || (A-s_i*I)*(x_i-q) || + || (lambda - s_i)*q || )^2 / gap = O(e^2) Now we take one step of inverse iteration to get x_(i+1); from earlier analysis (Sec 4.4.2, discussed in Chap 4 lectures) we know || x_(i+1)-q || <= || x_i-q || * | s_i - lambda |/gap = e * O(e^2) / gap = O(e^3) as desired To see that QR iteration is doing Rayleigh Quotient iteration in disguise, look at one step: T - s(i)*I = Q*R so (T - s(i)*I)^(-1) = R^(-1)*Q^(-1) = R^(-1)*Q^T ... since Q is orthogonal = (R^(-1)*Q^T)^T ... since the matrix is symmetric = Q*R^(-T) so (T - s(i)*I)^(-1) * R^T = Q so (T - s(i)*I)^(-1) * e_n * R(n,n) = q_n = last column of Q Since s(i) = T(n,n) = e_n^T * T * e_n, s(i) is just the Rayleigh Quotient for e_n, and q_n is the result of one step of Rayleigh quotient iteration. Then the updated T is R*Q + s(i)*I = Q^T*T*Q, so (updated T)(n,n) = q_n^T * T * q_n is the next Rayleigh quotient as desired. In practice QR iteration is implemented analogously to Chap 4, by "bulge chasing", at a cost of O(n) per iteration, or O(n^2) to find all the eigenvalues. But multiplying together all the Givens rotations still costs O(n^3), so something faster is still desirable. (We pause the recorded lecture here.) Divide-and-conquer: The important part is how to cheaply compute the eigendecomposition of a rank-one update to a symmetric matrix A + alpha*u*u^T, where we already have the eigendecomposition A = Q*Lambda*Q^T. Then A + alpha*u*u^T = Q*Lambda*Q^T + alpha*u*u^t = Q*(Lambda + alpha*(Q^T*u)*(Q^T*u))*Q^T = Q*(Lambda + alpha*v*v^T)*Q^T so we only need to think about computing the eigenvalues and eigenvectors of Lambda + alpha*v*v^T. Let's compute its characteristic polynomial: Lemma (homework!) det (I + x*y^T) = 1 + y^T*x Then the characteristic polynomial is det(Lambda + alpha*v*v^T - lambda*I) = det((Lambda - lambda*I)*(I + alpha*(Lambda-lambda*I)^(-1)*v*v^T)) = prod_i (lambda_i - lambda) * (1 + alpha*sum_i v(i)^2/(lambda_i - lambda)) = prod_i (lambda_i - lambda) * f(lambda) So our goal is to solve the so-called secular equation f(lambda) = 0. Consider figure 5.2 in the text (Secular1.ps): this plot of f(lambda) = 1 + .5/(1-lambda) + .5/(2-lambda) + .5/(3-lambda) + .5/(4-lambda) looks benign: we see there is one root of f(lambda) between each pair [lambda_i , lambda_(i+1)] of adjacent eigenvalues, and in each such interval f(lambda) is monotonic, so some Newton-like method should work well. But now consider figure 5.3 in the text (Secular2.ps): f(lambda) = 1 + .001/(1-lambda) + .001/(2-lambda) + .001/(3-lambda) + .001/(4-lambda) Here f(lambda) no longer looks so benign, and simple Newton would not work, taking a big step out of the bounding interval. But we can still use Newton, but instead of approximating the f(lambda) by a straight line at each step, and getting the next guess by finding a zero of that straight line, we approximate f(lambda) by an simple but non-linear function that better matches the graph, with poles at lambda_i and lambda_(i+1): g(lambda) = c1 + c2/(lambda_i - lambda) + c3/(lambda_(i+1) - lambda) where c1, c2 and c3 are chosen as in Newton to match f and f' at the last approximate 0. Solving g(lambda)=0 is then solving a quadratic for lambda. (picture). It remains to compute the eigenvectors: Lemma: If lambda is an eigenvalue of Lambda + alpha*v*v^T, then (Lambda - lambda*I)^(-1)*v is its eigenvector. Since Lambda is diagonal, this costs O(n) to compute. Proof: (Lambda + alpha*v*v^T)*(Lambda - lambda*I)^(-1)*v = (Lambda - lambda*I + lambda*I + alpha*v*v^T)*(Lambda - lambda*I)^(-1)*v = v + lambda*(Lambda - lambda*I)^(-1)*v + v*(alpha*v^T*(Lambda - lambda*I)^(-1)*v) = v + lambda*(Lambda - lambda*I)^(-1)*v - v since alpha*v^T*(Lambda - lambda*I)^(-1)*v + 1 = f(lambda) = 0 = lambda*(Lambda - lambda*I)^(-1)*v as desired Unfortunately this simple formula is not numerically stable; the text describes the clever fix (due to Ming Gu and Stan Eisenstat, for which Prof. Gu won the Householder Prize for best thesis in Linear Algebra in 1996). When two eigenvalues lambda_i and lambda_i+1 are (nearly) identical, then we know there is a root between them without doing any work. Similarly when v(i) is (nearly) zero, we know that lambda(i) is (nearly) an eigenvalue without doing any work. In practice a surprising number of eigenvalues can be found quickly ("deflated") this way, which speeds up the algorithm. Write [Q',Lambda']=Eig_update[Q,Lambda,alpha,u] as the function we just described that takes the eigendecomposition of A = Q*Lambda*Q^T and updates it, returning the eigendecomposition of A + alpha*u*u^T = Q'*Lambda'*Q'^T We can also use Eig_update as a building block for an entire eigensolver, using Divide-and-Conquer. To do so we need to see how to divide a tridiagonal matrix into two smaller ones of half the size, just using a rank one change alpha*u*u^T. We just divide in the middle: If T = tridiag(a(1),...,a(n); b(1),...,b(n-1)) is a symmetric tridiagonal matrix with diagonals a(i) and off-diagonals b(i), then we can write T = [ a(1) b(1) ] [ b(1) a(2) b(2) ] [ ... ] [ b(i-1) a(i) b(i) ] [ b(i) a(i+1) b(i+1) ] [ ... ] [ b(n-1) a(n) ] = [ a(1) b(1) ] + [ ] [ b(1) a(2) b(2) ] [ ] [ ... ] [ ] [ b(i-1) a(i)-b(i) 0 ] [ b(i) b(i) ] [ 0 a(i+1)-b(i) b(i+1) ] [ b(i) b(i) ] [ ... ] [ ] [ b(n-1) a(n) ] [ ] = diag(T1,T2) + b(i)*u*u^T where T1 and T2 are two half-size tridiagonal matrices (if i is n/2) and u is a vector with u(i)=u(i+1)=1 and the other u(j)=0. Using this notation, we can write our overall algorithm as follows: function [Q,Lambda] = DC_eig(T) ... return T = Q*Lambda*Q' n = dimension(T) if n small enough, use QR iteration, else i = floor(n/2) write T = diag(T1,T2) + b(i)*u*u^T ... just notation, no work [Q1,Lambda1] = DC_eig(T1) [Q2,Lambda2] = DC_eig(T2) ... note that diag(Q1,Q2) and diag(Lambda1,Lambda2) are ... eigendecomposition of diag(T1,T2) [Q,Lambda] = Eig_update(diag(Q1,Q2),diag(Lambda1,Lambda2),b(i),u) endif return (We pause the recorded lecture here.) Now we turn to algorithms that are fastest when only a few eigenvalues and eigenvectors are desired. Afterward we return to briefly describe MRRR, the fastest algorithm when all eigenvalues and eigenvectors are desired. Recall Sylvester's Theorem: Suppose A is symmetric and X nonsingular. Then A and X*A*X^T have the same Inertia = (#evals<0,#evals=0,#evals>0). So A - sigma*I and X*(A-sigma*I)*X^T have the same Inertia, namely (#evals of A < sigma, #evals of A = sigma, #evals of A > sigma) So if X*(A-sigma*I)*X^T were diagonal, it would be easy to count the number of eigenvalues of A less than, equal to or greater than sigma, for any sigma. By doing this for sigma_1 and sigma_2, we can count the number of eigenvalues in any interval [sigma_1 , sigma_2], and by repeatedly cutting intervals in half, we can compute intervals containing the eigenvalues that are as narrow as we like, or that only contain eigenvalues in regions of interest (eg the smallest). But how do we cheaply choose X to make X*(A - sigma*I)*X^T diagonal? By starting with A = T tridiagonal, and doing Gaussian elimination without pivoting: T - sigma*I = L*D*L^T, so inv(L)*(T - sigma*I)*inv(L)^T and D have the same inertia. However, LU without pivoting seems numerically dangerous. Fortunately, because of the tridiagonal structure, it is not, if done correctly: function c = Negcount(T,s) ... count the number c of eigenvalues of T < s ... assume T has a(1),...,a(n) on diagonal and b(1),...,b(n-1) off-diagonal ... only compute diagonal entries d(i) of D in T-s*I = L*D*L^T c = 0 b(0) = 0 d(0) = 1 for i = 1 to n d(i) = (a(i) - s) - b(i-1)^2/d(i-1) ... need to obey parentheses! if (d(i)<0), c=c+1 end We don't need l(i) = i-th off diagonal of L, because (T-s*I)(i,i+1) = b(i) = (L*D*L^T)(i,i+1) = d(i)*l(i) so we can replace the usual inner loop of LDL^T (analogous to LU) d(i) = a(i)-s - l(i-1)^2*d(i-1) by d(i) = a(i)-s - b(i-1)^2/d(i-1) Thm: Assuming we don't divide by zero, overflow or underflow, the computed c from Negcount(T,s) is exactly correct for a slightly perturbed T + E, where E is symmetric and tridiagonal, E(i,i) = 0 (the diagonal is not perturbed at all) and |E(i,i+1)| <= 2.5*macheps*|T(i,i+1)| Proof: As before, we replace each operation like a+b in the algorithm with (a+b)*(1+delta) where |delta| <= macheps. If we obey the parentheses in the algorithm, so a(i) - s is subtracted first, then we can divide out by all the (1+delta) factors multiplying a(i)-s to get d(i)*F(i) = a(i)-s - b(i-1)^2*G(i)/d(i-1) where F(i) and G(i) are the product of 2 or 3 (1+delta) factors. Now write this as d(i)*F(i) = a(i)-s - b(i-1)^2*G(i)*F(i-1)/(d(i-1)*F(i-1)) or d'(i) = a(i) - s - b'(i-1)^2 / d'(i-1) the exact recurrence for T+E. Since d(i) and d'(i) = d(i)*F(i) have the same sign, we get the same exact Negcount for either. In fact more is true: This works even if some pivot d(i-1) is exactly zero, and we divide by zero, so the next d(i) is infinite, because we end up dividing by d(i) in the next step, and the "infinity" disappears. But it does mean that to correctly compute Negcount, we need to count -0 as negative, and +0 as positive (-0 is a feature of IEEE floating point arithmetic). Furthermore, the function g(sigma) = #evals < sigma is computed monotonically increasing, as long as the arithmetic is correctly rounded (otherwise the computed number of eigenvalues in a narrow enough interval might be negative!). The cost of Negcount is clearly O(n). To find one eigenvalue with Negcount via bisection, Negcount needs to be called at most 64 (in double) or 32 (in single) times, since each time the interval gets half as big, and essentially determines one more bit in the floating point format. So the cost is still O(n). So to find k eigenvalues using bisection, the cost is O(k*n). (We pause the recorded lecture here.) Given an accurate eigenvalue lambda_j from bisection, we find its eigenvector by inverse iteration: choose x(0), i=0 repeat i = i+1 solve (T - lambda_j*I)*y = x(i-1) for y x(i) = y / ||y||_2 until x(i) converges which should converge in a few steps, and so at cost O(n) per eigenvector. This seems like an optimal algorithm for all n eigenvalues and eigenvectors: at a cost of O(n) per eigenpair, the total cost should be O(n^2). But it has a problem: the computed eigenvectors, while they individually very nearly satisfy A*x = lambda(j)*x as desired, may not be orthogonal when lambda(j) is very close to another lambda(j+1); there is nothing in the algorithm to enforce this, and when lambda(j) is close to lambda(j+1), solving with T - lambda(j)*I is nearly the same as solving with T - lambda(j+1)*I. Imagine the extreme case where lambda(j) and lambda(j+1) are so close that they round to the same floating point number! One could start with a random starting vector x(0) and hope for the best but there is no guarantee of orthogonality. At first people tried to guarantee orthogonality by taking all the computed eigenvectors belonging to a cluster of nearby eigenvalues and running QR decomposition on them, replacing them by the columns of Q. This guarantees orthogonality, but has two problems: (1) if the computed vectors are not linearly independent, the columns of Q may not satisfy A*q = lambda*q very well. (2) if the size of the cluster of close eigenvalues is s, the cost of QR decomposition is O(s^2*n), so if s is large, the cost could go back up to O(n^3). The MRRR = Multiple Relatively Robust Representations algorithm was motivated by this problem: computing all the eigenvectors in O(n^2) time such that are also guaranteed orthogonal. It is the fastest algorithm available (for large enough problems; it defaults to other algorithms for small problems), but is rather complicated to explain. Two lectures explaining it in detail are available from the 2004 Ma221 web page, see the class web site for details. The final algorithm of importance is Jacobi's method, whose classical (and so slow) implementation is described in section 5.3.5. Jacobi can be shown to be potentially the most accurate algorithm, with an error determined by a version of Weyl's Theorem for relative error (Thm 5.6 in the text), and so getting the tiny eigenvalues (and their eigenvectors) much more accurately. It was much slower than the other methods discussed so far, until work by Drmac and Veselic showed how it could be made much faster. All the algorithms discussed so far extend to the SVD, typically by using the relationship between the SVD of A and the eigenvalues and eigenvectors of the symmetric matrix [ 0 A ; A^T 0 ]. The one exception is MRRR, where some open problems remain, addressed in the 2010 PhD dissertation of Paul Willems, although there are still examples where Willems' code loses significant accuracy.