Math 128a - Spring 2002 - Lecture 7 - Feb 12 (Tuesday) Read: sections 6.0, 6.1, 6.2 Goals: Finish contraction mapping theorem from last time Solving f(x)=0 with more than one equation and unknown Maximizing and minimizing What to do when you don't have f'(x) Where to find state-of-the-art software Now we turn to solving a system of nonlinear equations in more than 1 unknown. In particular, write x = (x1,x2,...,xn) be a vector of n unknowns, and let f(x) = (f1(x),f2(x),...,fm(x)) be a vector of m functions. We would like to find the solution of f(x)=0, if one exists. The first question to explore is what to expect depending on the values of m and n. So far we have chosen m=n=1. What happens if m and n are different? What can the solutions look like? To illustrate, let's look at 1=m < n=2, fewer equations than unknowns. Example 1: f(x) = f1(x1,x2) = x1 + x2 = 0 In this case the solution is an entire straight line, so not a single point. Example 2: f(x) = x1^2 + x2^2 - 1 = 0 In this case the solution is a circle, not a single point Such systems are called "underdetermined" because of the likelihood of having an infinite number of solutions, lying on a curve. Now let's try 2=m > n=1, more equations and unknowns. Example 1: f1(x1) = x1 and f2(x1) = x1+1 There is clearly no simultaneous solution of x1=0 and x1+1=0. Such systems are called "overdetermined", because of the likelihood of of having no solutions. Finally, let's try 2=m = n. Example 1: f1(x1,x2) = x1 + x2 - 1, f2(x1,x2) = -x1 + x2 - 3. Thinking about this geometrically, the equations f1(x1,x2)=0 and f2(x1,x2) = 0 each determines a line in the plane, and their intersection point is the solution we seek. Thinking algebraically, this is a system of linear equations, which we typically write in the form [ 1 1 ] * [ x1 ] = [ 1 ] [-1 1 ] [ x2 ] [ 3 ] or A * x = b where A is an n-by-n coefficient matrix, x is a column vector of unknowns and b is a column vector of data We will talk about a systematic algorithm (called Gaussian Elimination) for this later, but for now just add the first equation to the second to get an equivalent system x1 + x2 = 1 2*x2 = 4 or [ 1 1 ] * [ x1 ] = [ 1 ] [ 0 2 ] [ x2 ] [ 4 ] Now solve the second equation for x2 = 4/2 = 2, and substitute x2=2 into the first equation to get x1 = -1. Thus there is a unique solution (x1,x2) = (-1,2). Example 2: Consider f1(x1,x2) = x1^2 + x2^2 = 1 f2(x1,x2) = (x1-1)^2 + x2^2 = 2 Geomtrically, each equation f1=0 and f2=0 determines a circle, and the intersection of these two circles (2 points) is the answer we seek. So we see that the solution can consist of a finite number of points, but more than 1. Algebraically, there is a technique like Gaussian elimination for systems of polynomial equations, called "computing Groebner bases", but it is much more expensive and sensitive to roundoff, and we will not talk about it in this class. Merely counting the number of unknowns n and equations m does not always tell us whether there are an infinite number of solutions lying on a curve (m<n) no solutions (m>n), or a finite number of solutions (m=n) even though these situations are "typical" in a certain sense. For example, the single equation f1(x1,x2) = |x1| + |x2| + 1 = 0 obviously has no solutions, and the two equations f1(x1) = 0 = f2(x1) clearly have an infinite number of solutions: every value of x1! But here we are interested in the typical case, and in particular when there are a finite number of solutions to find, so we will concentrate on m=n. So now we consider solving a system of n nonlinear equations in n unknowns: f1(x1,x2,...,xn) = 0 f2(x1,x2,...,xn) = 0 ... fn(x1,x2,...,xn) = 0 for a solution (x1,x2,...,xn). We also write this as f(x) = 0. We will again use Newton's method, with the same idea as before: 1) given an approximate zero x(i) = [x1(i),x2(i),...,xn(i)], approximate f(x) near x(i) by a simple (linear) function) f*(x). 2) find a zero of f*(x), call it x(i+1), and repeat. How to implement 1): look at each component of f, and use a Taylor expansion: fj(x) ~ fj(x(i)) + sum from k=1 to n [ d fj(x(i))/d xk * (xk - xk(i)) ] Setting to zero and solving for xk yields a system of n linear equations in n unknowns (x1 - x1(i)),...,(xn-xn(i)) d f1(x(i))/d x1 * (x1 - x1(i)) + ... + d f1(x(i))/d xn * (xn - xn(i)) = -f1(x(i)) d f2(x(i))/d x1 * (x1 - x1(i)) + ... + d f2(x(i))/d xn * (xn - xn(i)) = -f2(x(i)) .... d fn(x(i))/d x1 * (x1 - x1(i)) + ... + d fn(x(i))/d xn * (xn - xn(i)) = -fn(x(i)) In matrix form we write this as [ x1 - x1(i) ] = [ -f1(x(i)) ] J * [ x2 - x2(i) ] = [ -f2(x(i)) ] [ ... ] = [ ... ] [ xn - xn(i) ] = [ -fn(x(i)) ] Where the matrix J, called the Jacobian of f at x(i), has entries J(k,j) = d fk(x(i)) / d xj This linear systems can be solved by Gaussian Elimination (later in the course). In Matlab, you build an n-by-n matrix J, and the vector f(x(i)) = [ f1(x(i)) , f2(x(i)) , ... , fn(x(i)) ]' Then the new vector xnew = x(i+1) is computed from the old one xold = x(i) by xnew = xold - J\f(x(i)) Here J\f means "solve the linear system with coefficients J and vector f". If there is only one variable, this is identical to Newton's method from before. This multivariate Newton method converges quadratically, as does univariate Newton's method, with essentially the same analysis. Example: f1(x,y) = x^2 + a*y^2 - 1 = 0, f2(x,y) = (x-1)^2 + y^2 - 1 = 0 What are these two curves whose intersecton we want? run newt2 with x=1.1,y=.9,a=4; run newt2 with x=1,y=-1,a=4; run newt2 with x=2,y=2,a=4; End of Newton for solving f(x)=0 Now that we know how to find zeros of functions of one or more variables, what else can we do? We can try to minimize or maximize functions. This problem is called "optimization". If f(x) is a smooth of all real numbers, its critical points (where f'(x)=0) include its maxima and minima (picture). So solving f'(x)=0 is a way to find minima and maxima. If f(x) is a scalar function of n variables, you want to find the zeros of the gradient grad f, which is an n-dimensional function of n variables, so we can apply Newton. Now the Jacobian matrix has derivative of grad f, i.e. second derivatives of f in it; this matrix is called the Hessian. All this works well, for zero finding or optimization, presuming that you are able to compute the derivative of the function whose zeros you want. But now we have the same problem as in a single equation with a single unknown: How do you compute (or estimate) all the derivatives you need? In practice, f(x) is unlikely to be a simple function that you can differentiate by hand. Instead, it is likely to be evaluatable only by a very large program. Example: Suppose you have a mechanical (or electrical) device to design to perform a certain task, and you'd like to design it in the best possible way. You choose a set of parameters, which define the shape of your device (in the mechanical case) and have a program that simulates how the device works in practice. For example, imagine you work at Boeing and want to optimize the shape of a new wing design. So you choose some parameters, probably hundreds, that define the shape of a wing (they are the parameters in some Computer Aided Manufacturing (CAM) tool). You then put this wing shape into an existing program that computes how the air flows over the wing, and among other things, measures the drag on the airplane. You want to minimize drag, because this will save on fuel costs, something your customer, the airline industry, cares a lot about (Boeing and Airbus compete on this point). So, mathematically speaking, you have a function drag = g(x), where drag is a scalar, and x is a vector of hundreds of parameters defining wing shape. g(x) is evaluated by running a large, complicated program consisting of dozens of subroutines, written and modified over a period of many years by many people. No one may really understand how it works anymore, or have time to figure it out. You want to minimize drag = g(x) as a function of x. You assume that g is differentiable, and so expect the minimum to occur somewhere where the gradient of g is zero: grad g = ( dg/dx(1), dg/dx(2), ... , dg/dx(n) ) = (0, 0, ... , 0) This is because if the gradient is nonzero, it says that you can decrease g by moving x in a direction opposite to that of the gradient. This means we want to solve f(x) = 0 , where f = grad g is a vector of n functions (dg/dx(i), i=1,...,n) of of n unknowns (x(j), j=1,...,n). In this example, just evaluating our function f(x) requires derivatives. solving f(x)=0 may require yet more. So what techniques are there for evaluating derivatives of a function that we cannot easily differentiate by hand? 1) We can estimate derivatives by differences, as in the secant method. To see how much this costs, suppose we have a single function of many variables f(x1,x2,...,xn), and we want to estimate the gradient of f. Mimicing the approximation we used in the secant method, we write df/dxi ~ [ f(x1,...,xi+d,....,xn) - f(x1,...,xi,...,xn) ] / d Thus, each derivative costs us n more function calls to the program computing f. This is much different from the secant method, which required no extra function evaluations. When f has n components, this is n^2 extra function evaluations. When n is just 10, this is already 100 more function evaluations per step, which is a lot. 2) There are software systems that take a program that computes a function f(x), and automatically produces another function for df/dx(x). It does this "line-by-line", using techniques like the following. Suppose there is a line of code z = x+y where x and y have been computed previously. Suppose we have modified the program up to this line, so that new variables dx and dy have been computed, which are the derivatives of x and y. Then we can add another line of code to compute dz, the derivative of z, as follows: dz = dx + dy. Similarly, if z = x * y we add the code dz = x * dy + dy * z and for z = x / y we add dz = (dx * y - x * dy ) / y^2 This can in fact be done in some languages by declaring variables like z to be pairs [z,dz], and redefining an operation like multiplication to be update both z and dz. Thus z = x * y really computes [ z ] = [ x*y ] [ dz ] [ x * dy + dy * z ] Thus, just by changing the declarations and not the code we can compute derivatives. You may well ask how to do deal with branches, loops, subroutine calls, etc. There is software for dealing with all these possibilities, that use compiler techniques to take a program for f(x) and return one for df/dx (x). See the following URL for details. http://www.mcs.anl.gov/autodiff/index.html The software here handles both Fortran and C routines. It has been used for differentiating a program that computes the air flow over a wing, to help optimize the wing shape. 3) To lower the cost even more, we can update our estimates of the Jacobian less frequently. On your homework you will consider the scalar iteration x(n+1) = x(n) - f(x(n))/alpha and show that alpha does not have to be too close to f'(x(n)) to get convergence. The same is true with a system of equations, and there are a variety of methods to estimate the Jacobian cheaply and still get convergence ("Broyden's method"). No details here, but you should know they are available. Finally, a word on finding good software for finding zeros or optimization. This is such an important area that a lot of good algorithms and publically available software exists, developed mostly at universities and Federal government labs (your tax dollars at work). See www-fp.mcs.anl.gov/otc/Guide/ - pointers to available software, documentation www-neos.mcs.anl.gov - on-line optimization server