CS61B: Lecture 37 Wednesday, April 23, 2014 AMORTIZED ANALYSIS ================== We've seen several data structures for which I claimed that the average time for certain operations is always better than the worst-case time: hash tables, tree-based disjoint sets, and splay trees. The mathematics that proves these claims is called _amortized_analysis_. Amortized analysis is a way of proving that even if an operation is occasionally expensive, its cost is made up for by earlier, cheaper operations. The Averaging Method -------------------- Most hash table operations take O(1) time, but sometimes an operation forces a hash table to resize itself, at great expense. What is the average time to insert an item into a hash table with resizing? Assume that the chains never grow longer than O(1), so any operation that doesn't resize the table takes O(1) time--more precisely, suppose it takes at most one second. Let n be the number of items in the hash table, and N the number of buckets. Suppose it takes one second for the insert operation to insert the new item, increment n, and then check if n = N. If so, it doubles the size of the table from N to 2N, taking 2N additional seconds. This resizing scheme ensures that the load factor n/N is always less than one. Suppose every newly constructed hash table is empty and has just one bucket-- that is, initially n = 0 and N = 1. After i insert operations, n = i. The number of buckets N must be a power of two, and we never allow it to be less than or equal to n; so N is the smallest power of two > n, which is <= 2n. The total time in seconds for _all_ the table resizing operations is 2 + 4 + 8 + ... + N/4 + N/2 + N = 2N - 2. So the cost of i insert operations is at most i + 2N - 2 seconds. Because N <= 2n = 2i, the i insert operations take <= 5i - 2 seconds. Therefore, the _average_ running time of an insertion operation is (5i - 2)/i = 5 - 2/i seconds, which is in O(1) time. We say that the _amortized_running_time_ of insertion is in O(1), even though the worst-case running time is in Theta(n). For almost any application, the amortized running time is more important than the worst-case running time, because the amortized running time determines the total running time of the application. The main exceptions are some applications that require fast interaction (like video games), for which one really slow operation might cause a noticeable glitch in response time. The Accounting Method --------------------- Consider hash tables that resize in both directions: not only do they expand as the number of items increases, but they also shrink as the number of items decreases. You can't analyze them with the averaging method, because you don't know what sequence of insert and remove operations an application might perform. Let's try a more sophisticated method. In the _accounting_method_, we "charge" each operation a certain amount of time. Usually we overcharge. When we charge more time than the operation actually takes, we can save the excess time in a bank to spend on later operations. Before we start, let's stop using seconds as our unit of running time. We don't actually know how many seconds any computation takes, because it varies from computer to computer. However, everything a computer does can be broken down into a sequence of constant-time computations. Let a _dollar_ be a unit of time that's long enough to execute the slowest constant-time computation that comes up in the algorithm we're analyzing. A dollar is a real unit of time, but it's different for different computers. Each hash table operation has - an _amortized_cost_, which is the number of dollars that we "charge" to do that operation, and - an _actual_cost_, which is the actual number of constant-time computations the operation performs. The amortized cost is usually a fixed function of n (e.g. $5 for insertion into a hash table, or $2 log n for insertion into a splay tree), but the actual cost may vary wildly from operation to operation. For example, insertion into a hash table takes a long, long time when the table is resized. When an operation's amortized cost exceeds its actual cost, the extra dollars are saved in the bank to be spent on later operations. When an operation's actual cost exceeds its amortized cost, dollars are withdrawn from the bank to pay for an unusually expensive operation. If the bank balance goes into surplus, it means that the actual total running time is even faster than the total amortized costs imply. THE BANK BALANCE MUST NEVER FALL BELOW ZERO. If it does, you are spending more total dollars than your budget claims, and you have failed to prove anything about the amortized running time of the algorithm. Think of amortized costs as an allowance. If your dad gives you $500 a month allowance, and you only spend $100 of it each month, you can save up the difference and eventually buy a car. The car may cost $30,000, but if you saved that money and don't go into debt, your _average_ spending obviously wasn't more than $500 a month. Accounting of Hash Tables ------------------------- Suppose every operation (insert, find, remove) takes one dollar of actual running time unless the hash table is resized. We resize the table in two circumstances. - An insert operation doubles the table size if n = N AFTER the new item is inserted and n is incremented, taking 2N additional dollars of time for resizing to 2N buckets. Thus, the load factor is always less than one. - The remove operation halves the table size if n = N/4 AFTER the item is deleted and n is decremented, taking N additional dollars of time for resizing to N/2 buckets. Thus, the load factor is always greater than 0.25 (except when n = 0, i.e. the table is empty). Either way, a hash table that has _just_ been resized has n = N/2. A newly constructed hash table has n = 0 items and N = 1 buckets. By trial and error, I came up with the following amortized costs. insert: 5 dollars remove: 5 dollars find: 1 dollar Is this accounting valid, or will we go broke? The crucial insight is that at any time, we can look at a hash table and know a lower bound for how many dollars are in the bank from the values of n and N. We know that the last time the hash table was resized, the number of items n was exactly N/2. So if n != N/2, there have been subsequent insert/remove operations, and these have put money in the bank. We charge an amortized $5 for an insert or remove operation. Every insert or remove operation costs one actual dollar (not counting resizing) and puts the remaining $4 in the bank to pay for resizing. For every step n takes away from N/2, we accumulate another $4. So there must be at least 4|n - N/2| dollars saved (or 4n dollars for a never-resized one-bucket hash table). IMPORTANT: Note that 4|n - N/2| is a function of the data structure, and does NOT depend on the history of hash table operations performed. In general, the accounting method only works if you can tell how much money is in the bank (or, more commonly, a minimum bound on that bank balance) just by looking at the current state of the data structure--without knowing how the data structure reached that state. An insert operation only resizes the table if the number of items n reaches N. According to the formula 4|n - N/2|, there are at least 2N dollars in the bank. Resizing the hash table from N to 2N buckets costs 2N dollars, so we can afford it. After we resize, the bank balance might be zero again, but it isn't negative. A remove operation only resizes the table if the number of items n drops to N/4. According to the formula 4|n - N/2|, there are at least N dollars in the bank. Resizing the hash table from N to N/2 buckets costs N dollars, so we can afford it. The bank balance never drops below zero, so my amortized costs above are valid. Therefore, the amortized cost of all three operations is in O(1). Observe that if we alternate between inserting and deleting the same item over and over, the hash table is never resized, so we save up a lot of money in the bank. This isn't a problem; it just means the algorithm is faster (spends fewer dollars) than my amortized costs indicate. Why Does Amortized Analysis Work? --------------------------------- Why does this metaphor about putting money in the bank tell us anything about the actual running time of an algorithm? Suppose our accountant keeps a ledger with two columns: the total amortized cost of all operations so far, and the total actual cost of all operations so far. Our bank balance is the sum of all the amortized costs in the left column, minus the sum of all the actual costs in the right column. If the bank balance never drops below zero, the total actual cost is less than or equal to the total amortized cost. Total amortized cost | Total actual cost ------------------------------------------ $5 | $1 $1 | $1 $5 | $3 . | . . | . . | . $5 | $1 $5 | $2,049 $1 | $1 ------------------------------------------ $12,327 >= $10,333 Therefore, the total running time of all the actual operations is never longer than the total amortized cost of all the operations. Amortized analysis (as presented here) only tells us an upper bound (big-Oh) on the actual running time, and not a lower bound (big-Omega). It might happen that we accumulate a big bank balance and never spend it, and the total actual running time might be much less than the amortized cost. For example, splay tree operations take amortized O(log n) time, where n is the number of items in the tree, but if your only operation is to find the same item n times in a row, the actual average running time is in O(1). If you want to see the amortized analysis of splay trees, Goodrich and Tamassia have it. If you take CS 170, you'll see an amortized analysis of disjoint sets. I am saddened to report that both analyses are too complicated to provide much intuition about their running times. (Especially the inverse Ackermann function, which is ridiculously nonintuitive, though cool nonetheless.)