CS61B: Lecture 39 Wednesday, April 30, 2014 Today's reading: Goodrich & Tamassia, Sections 14.1.2-14.1.3. GARBAGE COLLECTION ================== Objects take up space in memory. If your program creates lots of objects, throws them away, and creates more, you might eventually run out of memory. To reduce the chance that this will happen, Java has garbage collection. While the Java Virtual Machine (JVM) runs your program, it also spends little bits of time searching for objects that you're no longer using, so it can reclaim their memory for use by other objects. You don't have to know anything about garbage collection to be an effective Java programmer. But garbage collection is interesting because the JVM uses a lot of hidden data structures to manage memory. These data structures are hidden from your Java program--after all, the JVM, just like any other encapsulated module, should hide the details of how it is implemented. Here's a peak at what's going on under the hood. Roots and Reachability ---------------------- Garbage collection's prime directive is to never sweep up an object your program might possibly use or inspect again. These objects are said to be _live_. The opposite of "live" is _garbage_--objects that your program cannot reference again. Java's design makes it possible for the JVM to determine whether an object can ever be used again by your program or not. Garbage collection begins at the roots. A _root_ is any object reference your program can access directly, without going through another object. There are two kinds (that we know about). First, every local variable (including parameters) in every stack frame on the program stack is a root if it is a reference. (Primitive types like ints are not roots; only references are.) Second, every class variable (aka "static" field) that is a reference is a root. An object is live, and should not be garbage collected, if - it is referenced by a root (obviously), or - it is referenced by a field in another live object. Note that this definition is recursive. Another way to say it is that an object is live if it is _reachable_ from the roots. If you run depth-first search starting at all the roots, you will visit all the live objects and none of the garbage. And that's exactly what garbage collectors do: run depth-first search from the roots. Each object has a small tag that allows the garbage collector to mark whether the object has been visited or not. The tag is invisible to your Java program, but it takes a few bits of the object's memory. (This is not the only "hidden" memory Java associates with an object--for example, every object has a hidden label that tells Java what class it's in.) Memory Addresses ---------------- In any modern computer, memory is one huge array of bytes with addresses. ---------------------------- | | | | | | | ... ---------------------------- 0 1 2 3 4 5 6 However, modern computers prefer to read or write four bytes at a time, and they do this much faster if the four bytes start at an address divisible by four, so that's how things like ints and floats are stored. Every time you declare a local variable, you are naming a memory location. (You pick the name, Java picks the address.) An assignment statement writes something into a memory location. bob ----------------------------- int bob; ... | | | 3 | | | | ... ----------------------------- bob = 3; 208 212 216 220 224 228 Computers can store memory addresses in memory. To reduce the number of syllables, memory addresses are called _pointers_ for short. Some languages like C allow you to declare variables that are pointers. A pointer field is a memory location that points to another memory location. bob ptr ----------------------------- ... | | | 3 | |216| | ... --------------------+-------- 208 212 216 220 | 228 ^ | | | \-------/ Java uses pointers too, but it considers them confidential information, and won't let you print them or look at the numbers directly. Java references are a little bit like pointers, but as we'll learn soon, they're actually more complicated than pointers. The important point is that your computer's memory is just one giant array that has no structure except the structure you impose on it. Java saves you a huge amount of time and effort by structuring memory for you. Java does this by using hidden pointer-based data structures that you can't access from a Java program. Mark and Sweep Garbage Collection --------------------------------- A mark-and-sweep garbage collector runs in two separate phases. The _mark_ phase does a DFS from every root, and marks all the live objects. The _sweep_ phase does a pass over all the objects in memory. Each object that was not marked as being live is garbage, and its memory is reclaimed. How does the sweep phase do a pass over all the objects in memory? The JVM has an elaborate internal data structure for managing the heap. This hidden data structure keeps track of free memory and allocated memory so that new objects can be allocated without overwriting live ones. Time prevents my describing Java's heap data structure here, but it allows the garbage collector to do a pass over every object, even the ones that are not live. It's roughly like an invisible linked list that links _everything_. Similarly, the stack frames on the stack are data structures that make it possible for the garbage collector to determine which data on the stack are references, and which are not. When a mark-and-sweep collector runs, your program stops running for an instant while the garbage collector does a mark pass and a sweep pass. The garbage collector is typically started when the JVM tries to create a new object but doesn't have enough memory for it. Compaction ---------- Typical programs allocate and forget a good many objects. One problem that arises is _fragmentation_, the tendency of the free memory to get broken up into lots of small pieces. Fragmentation can render Java unable to allocate a large object despite having lots of free memory available. (Fragmentation also means that the memory caches and virtual memory don't perform as well. If you don't know why, wait until CS 61C or CS 152.) To solve this problem, a compacting garbage collector actually picks up the objects and moves them to different locations in memory, thereby removing the space between the objects. This is easily done during the sweep phase. ------------------------------------- ------------------------------------- |object object object object | => |objectobjectobjectobject | ------------------------------------- ------------------------------------- References ---------- There's a problem here: if we pick up an object and move it, what about all the references to that object? Aren't those references wrong now? Interestingly, in the Oracle JVM, a reference isn't a pointer. A reference is a handle. A _handle_ is a pointer to a pointer. When an object moves, Java corrects the second pointer so it points to the object's new address. That way, even if there are a million references to the object, they're all corrected in one fell swoop. The "second pointers" are kept in a special table, since they don't take as much memory as objects. reference reference reference reference reference reference | | | | | | | v | | v | \----->====<------/ \----->====<------/ /---+- | ==> | -+----\ | ==== ==== | v v object object "Over here" "No, wait, over here" The special table of "second pointers" does not suffer from fragmentation because all pointers have exactly the same size. Objects suffer from fragmentation because when a small object is garbage collected, the space it leaves behind might not be large enough to accommodate a larger object. But a garbage-collected object's "second pointer" can simply be reused by any newly constructed object that comes along, because all "second pointers" have the same size. Copying Garbage Collection -------------------------- Copying garbage collection is an alternative to mark and sweep. It does compaction, but it is faster than mark and sweep with compaction because there is only one phase, rather than a mark phase and a sweep phase. Memory is divided into two distinct spaces, called the old space and the new space. A copying garbage collector finds the live objects by DFS as usual, but when it encounters an object in the old space, it _immediately_ moves it to the new space. The object is moved to the first available memory location in the new space, so compaction is part of the deal. After all the objects are moved to the new space, the garbage objects that remain in the old space are simply forgotten. There is no need for a sweep phase. Next time the garbage collector runs, the new space is relabeled the "old space" and the old space is relabeled the "new space". Long-lived objects may be copied back and forth between the two spaces many times. While your program is running (between garbage collections), all your objects are in one space, while the other space sits empty. The advantage of copying garbage collection is that it's fast. The disadvantage is that you effectively cut in half the amount of heap memory available to your program.