CS 4: Lecture 22 Wednesday, April 12, 2006 OPTIMIZATION ============ Let f(T, p) be a function that maps the temperature T of your car's engine and the percentage p of air in the air/gasoline mixture to the amount of carbon dioxide your car emits per mile of driving. Being a good citizen, you want to tune your engine to minimize your CO2 emission. This is called an _optimization_problem_. Science and engineering have literally millions of optimization problems. - Find the antenna length that maximizes signal reception. - Find the driving speed that maximizes fuel efficiency. - Find the doping (impurity) levels in silicon that maximize transistor speed. - Choose the shape of an airplane wing that maximizes lift. - Find the diet that meets a student's nutritional requirements for the fewest dollars. Formally, many optimization problems look like this. Find the values x_1, x_2, ..., x_d that minimize f(x_1, x_2, ..., x_d). Often you want to maximize a quantity, rather than minimize it, but maximizing f is the same as minimizing -f, so most optimization problems can be cast as minimization problems. The variables x_1, x_2, ..., x_d are the _independent_variables_ or _parameters_ of the optimization problem. To simplify notation, we'll wrap them up into one vector x = (x_1, x_2, ..., x_d). This seems natural when x is the position of a radio tower in 3D space that optimizes the cell phone reception in Berkeley. It seems less natural when x_1 is temperature, x_2 is air/gasoline mixture, and x_3 is engine RPMs. Nevertheless, even when the variables are in entirely different units, we're still going to wrap them up into one vector, and treat it as a point in a d-dimensional _parameter_space_. That's why the number of parameters is denoted by d--it stands for "dimension." The dependent variable f(x) is called the _objective_function_. Memorize this term. Virtually every optimization problem revolves around optimizing some objective function. Local vs. Global Minima ----------------------- Ideally, we would like to find the _global_minimum_ of f--that is, the value x such that f(x) <= f(y) for _every_ vector y. Unfortunately, objective functions often have local minima. A _local_minimum_ of f is a value x such that f(x) <= f(x + y) for every y such that |y| < r for some positive r. In other words, there is no better solution than x within some radius r of x. Consider the function f(a) = 3 a^4 - 4 a^3 - 12 a^2. * ^ f(a) * * | * * | * * | * * | * * | * -+---*-+---**O**---+-----+----*+-> a ***** ** global * * minimum * ^ * | * | * v * | ** * local minimum **** This function has two local minima, at a = -1 and a = 2. However, only a = 2 is a global minimum. The other local minimum isn't nearly as good. As a general rule, finding local minima is easy. Finding global minima is sometimes easy, and sometimes very, very hard. Many practical problems require the minimum of a quadratic function of several variables. A quadratic function has at most one local minimum, which therefore must be the global minimum! The reason it's sometimes very hard to find a global minimum is because it might be very far from the local minima you find. If your goal is to maximize your altitude, you might feel pretty good on top of Mount McKinley, but Mount Everest is on the other side of the planet. Consider the function f(theta) = sin theta. ** ** f(theta)^ ** ** ** * * * * * | * * * * * * * * * * |* * * * * * * * * * * * * * * --*------*------*------*------*------O------*------*------*------*----> theta * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** ** ** ** ** This function has infinitely many local minima, all of which are global minima too. But suppose we add another sinusoidal curve with a different frequency, like f(theta) = sin theta + 0.1 cos (theta / pi). * f(theta)^ ** * * * | * * ** * * * * * |* * * * * * * * * * * * * * * * * * * * * * * * * --*------*------*------*------*------O------*------*------*------*----> theta * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * ** * * * ** * This curve has infinitely many local minima, but none of them is a global minimum! No matter which local minimum you pick, you can always find one that's even lower. (The local minima get arbitrarily close to -1.1, but none reaches it.) In an optimization problem, what kind of minimum do we need? For a few problems, any local minimum will do. For a few problems, the global minimum is absolutely necessary. But for most problems, we just need a very good local minimum--preferably one that's almost as good as the global minimum. Your car might not be getting the absolute minimum carbon dioxide emissions possible, but if you're only off by a few percent, you won't get all stressed about it. An important consideration in choosing an optimization algorithm is, how long does a function evaluation take? Sometimes they're fast, but in some applications, function evaluations take a long time. In an extreme case, a function evaluation might involve performing a scientific experiment, like actually measuring the carbon dioxide output of a car under certain conditions. Random Search ------------- Our first optimization algorithm is very simple. - Pick an integer n signifying how long you're willing to wait for an answer. - Repeat n times: o Choose a random vector x in the domain. o If f(x) is the smallest solution so far, remember x. - Output the best x you found. Random search is a lousy algorithm, but it's better than nothing. Its main fault is that it doesn't exploit the fact that a good value of x probably has an even better value of x nearby. When you find yourself halfway up Mount Everest, you should look around for the peak. Instead, the random search algorithm immediately jumps to another random location on earth. A lesser fault is that random search may neglect some parts of the search space, because it just randomly neglects to check them. Grid Search ----------- The grid search algorithm addresses the lesser fault of +-------------------+ the random search algorithm. The algorithm evaluates f | | at the vertices of a rectangular grid, and chooses the | o o o o o o o o o | vertex with the least value of f. For example, in the | | domain at right, we evaluate f at the points marked `o'. | o o o o o o o o o | | | Grid search seems clever in a two-dimensional domain, | o o o o o o o o o | but its weakness is "the curse of dimensionality": if | | you use, say, a thousand grid points along each axis, | o o o o o o o o o | then you need 1,000^d grid points overall, where d is the | | dimension of the parameter space. That gets big if d is +-------------------+ more than two! If you have six parameters and can domain afford a million function evaluations, that only gives you ten grid points in each direction. That's not usually enough to find a good minimum. (Random search suffers from the curse of dimensionality just as badly, though it's not as obvious as with a grid.) As a general rule, grid search is most appropriate when function evaluations are cheap or the number of parameters is small. But it's only a good start; you should refine the result using an algorithm like grid walk or Nelder-Mead. Grid Walk --------- The grid walk algorithm is like the grid search algorithm, except that you don't search the whole grid. Instead, you choose a starting point and "walk downhill" from there, evaluating f only at nearby locations. For grid walk, you need to choose a starting point x (perhaps the center of the domain) and a step size s (perhaps 10% of the domain width). Then you evaluate the objective function f at the following locations. f(x_1, x_2, ..., x_n) f(x_1 + s, x_2, ..., x_n) f(x_1 - s, x_2, ..., x_n) f(x_1, x_2 + s, ..., x_n) f(x_1, x_2 - s, ..., x_n) ... f(x_1, x_2, ..., x_n + s) f(x_1, x_2, ..., x_n - s) In other words, you consider taking a step along each of the parameter axes, in either the positive or negative directions. Then you walk to the best point (i.e. set x to be the point with the best objective function) and repeat. Observe that grid walk breaks the curse of dimensionality. Instead of testing a number of points exponential in d, now you just check 2d points other than x. Thus grid walk can be a lot faster than the previous two algorithms. What if x is better than those 2d points? You reduce the step size, perhaps by cutting s in half, and try again. If x is still best, reduce the step size again. This gives grid walk another big advantage over the previous algorithms: when it finds a good point, it tries to _refine_ the result by searching locally, in successively smaller steps, until it homes in on a local minimum. In the following example, grid walk takes a step to the right, then reduces the step size, then takes a step down, then terminates. +---------+ +---------+ +---------+ +---------+ | | | | | | | | | o | | o | | | | | | | | | | o | | | | o x o | -> | o x o | -> | oxo | -> | o | -> done | | | | | o | | oxo | | o | | o | | | | o | | | | | | | | | +---------+ +---------+ +---------+ +---------+ An important decision you need to make when you implement grid walk is when to stop. The usual rule is to stop when the step size is so small that you don't expect to make much further progress. In other words, choose an s* > 0 and stop when s < s*. Note that you don't have to use the step size for each parameter. In fact, if the parameters are measured in different units--say, temperature, percentage air/gas mixture, and engine RPMs, then you need a different step size for each parameter. When you get stuck, you cut all the step sizes in half. Grid search represents a huge advance in speed over the previous algorithms, especially in high-dimensional parameter spaces. But it has several faults of its own. - It's happy to stop at any local minimum, even a "bad" one that's not close to the global minimum. - It can prematurely shrink the step size, then take many, many, many steps to walk to a local minimum. You'll also have this problem if you choose too small a step size from the start. The first problem can be lessened by combining two algorithms. The second problem can be lessened by a more complicated optimization algorithm called the Nelder-Mead simplex algorithm.