CS 61B: Lecture 27
                           Wednesday, April 2, 2014

2-3-4 TREES
===========
Last lecture, we learned about the Ordered Dictionary ADT, and we learned one
data structure for implementing it:  binary search trees.  Today we learn
a faster one.

A 2-3-4 tree is a perfectly balanced tree.  It has a big advantage over regular
binary search trees:  because the tree is perfectly balanced, find, insert, and
remove operations take O(log n) time, even in the worst case.

2-3-4 trees are thus named because every node has 2, 3, or 4 children, except
leaves, which are all at the bottom level of the tree.  Each node stores 1, 2,
or 3 entries, which determine how other entries are distributed among its
children's subtrees.

Each internal (non-leaf) node has one more child than entries.  For example,
a node with keys [20, 40, 50] has four children.  Eack key k in the subtree
rooted at the first child satisfies k <= 20; at the second child,
20 <= k <= 40; at the third child, 40 <= k <= 50; and at the fourth child,
k >= 50.

WARNING:  The algorithms for insertion and deletion I'll discuss today are
different from those discussed by Goodrich and Tamassia.  The text presents
"bottom-up" 2-3-4 trees, so named because the effects of node splits at the
bottom of the tree can work their way back up toward the root.  I'll discuss
"top-down" 2-3-4 trees, in which insertion and deletion finish at the leaves.
Top-down 2-3-4 trees are usually faster than bottom-up ones, because both trees
search down from the root to the leaves, but only the bottom-up trees sometimes
go back up to the root.  Goodrich and Tamassia call 2-3-4 trees "(2, 4) trees".

2-3-4 trees are a type of B-tree, which you may learn about someday in
connection with fast disk access for database systems.  B-trees on disks
usually use the top-down insertion/deletion algorithms, because accessing
a disk track is slow, so you'd rather not revisit it multiple times.

[1]  Entry find(Object k);

Finding an entry is straightforward.        ==========
Start at the root.  At each node,           +20 40 50+
check for the key k; if it's not         /--==========------\
present, move down to the           /---/      /  \          \-----\
appropriate child chosen by     ----      ----      ----            =======
comparing k against the keys.   |14|      |32|      |43|            +70 79+
Continue until k is found,      ----      ----      ----            =======
or k is not found at a          /  \      /  \      /  \            /  |  \
leaf node.  For example,     ---- ---- ---- ---- ---- ---- ---------- ==== ----
find(74) visits the          |10| |18| |25| |33| |42| |47| |57 62 66| +74+ |81|
double-lined boxes at right. ---- ---- ---- ---- ---- ---- ---------- ==== ----

Incidentally, you can define an inorder traversal on 2-3-4 trees analogous to
that on binary trees, and it visits the keys in sorted order.

[2]  Entry insert(Object k, Object e);

insert(), like find(), walks down the tree in search of the key k.  If it finds
an entry with key k, it proceeds to that entry's "left child" and continues.

Unlike find(), insert() sometimes modifies             ----         -------
nodes of the tree as it walks down.                    |20|         |11 20|
Specifically, whenever insert() encounters             ----         -------
a 3-key node, the middle key is ejected,               /  \   =>    /  |  \
and is placed in the parent node instead.     ========== ----    ---- ---- ----
Since the parent was previously treated the   +10 11 12+ |30|    |10| |12| |30|
same way, the parent has at most two keys,    ========== ----    ---- ---- ----
and always has room for a third.  The other
two keys in the 3-key node are split into two separate 1-key nodes, which are
divided underneath the old middle key (as the figure illustrates).

For example, suppose we                      ----                              
insert 60 into the tree                      |40|                              
depicted in [1].  The                      /------\                            
first node visited is                 /---/        \----\                      
the root, which has three          ----                  ----                  
keys; so we kick the               |20|                  |50|                  
middle key (40) upstairs.          ----                /------\                
Since the root node has           /    \              /        \               
no parent, a new node         ----      ----      ----          ----------     
is created to hold 40         |14|      |32|      |43|          |62 70 79|     
and becomes the root.         ----      ----      ----          ----------     
Similarly, 62 is kicked       /  \      /  \      /  \          /  |  |   \    
upstairs when insert()     ---- ---- ---- ---- ---- ---- ------- ---- ---- ----
finds the node containing  |10| |18| |25| |33| |42| |47| |57 60| |66| |74| |81|
it.  This ensures us that  ---- ---- ---- ---- ---- ---- ------- ---- ---- ----
when we arrive at the leaf
(labeled 57 in this example), there's room to add the new key 60.

Observe that along the way, we created a new 3-key node "62 70 79".  We do not
kick its middle key upstairs until the next time it is visited.

Again, the reasons why we split every 3-key node we encounter (and move its
middle key up one level) are (1) to make sure there's room for the new key in
the leaf node, and (2) to make sure that above the leaves, there's room for any
key that gets kicked upstairs.  Sometimes, an insertion operation increases the
height of the tree by one by creating a new root.

[3]  Entry remove(Object k);

2-3-4 tree remove() is similar to remove() on binary search trees:  you find
the entry you want to remove (having key k).  If it's in a leaf, you remove it.
If it's in an internal node, you replace it with the entry with the next higher
key.  That entry is always in a leaf.  In either case, you remove an entry from
a leaf in the end.

Like insert(), remove() changes nodes of the tree as it walks down.  Whereas
insert() eliminates 3-key nodes (moving keys up the tree) to make room for new
keys, remove() eliminates 1-key nodes (pulling keys down the tree) so that a
key can be removed from a leaf without leaving it empty.  There are three ways
1-key nodes (except the root) are eliminated.

(1)  When remove() encounters a 1-key  -------                  -------        
node (except the root), it tries       |20 40|                  |20 50|        
to steal a key from an adjacent        -------                  -------        
sibling.  But we can't just steal      /  |  \          =>     /   |   \       
the sibling's key without          ---- ==== ----------    ---- ------- -------
violating the search tree          |10| +30+ |50 51 52|    |10| |30 40| |51 52|
invariant.  This figure shows      ---- ==== ----------    ---- ------- -------
remove's action, called a           /\   /\   / |  | \      /\   / | \   / | \ 
"rotation", when it reaches "30".            S                        S        
We move a key from the sibling to
the parent, and we move a key from the parent to the 1-key node.  We also move
a subtree S from the sibling to the 1-key node (now a 2-key node).

Goodrich & Tamassia call rotations "transfer" operations.  Note that we can't
steal a key from a non-adjacent sibling.

(2)  If no adjacent sibling has more than one     -------               ----   
key, a rotation can't be used.  In this case,     |20 40|               |40|   
the 1-key node steals a key from its parent.      -------               ----   
Since the parent was previously treated the       /  |  \    =>         /  \   
same way (unless it's the root), it has at    ==== ---- ----    ---------- ----
least two keys, and can spare one.  The       +10+ |30| |50|    |10 20 30| |50|
sibling is also absorbed, and the 1-key node  ==== ---- ----    ---------- ----
becomes a 3-key node.  The figure illustrates
remove's action when it reaches "10".  This is called a "fusion" operation.

(3)  If the parent is the root and contains only one key, and the sibling
contains only one key, then the current 1-key node, its 1-key sibling, and the
1-key root are fused into one 3-key node that serves as the new root.  The
height of the tree decreases by one.

Eventually we reach a leaf.  After we process the leaf, it has at least two
keys (if there are at least two keys in the tree), so we can delete the key
and still have one key in the leaf.

For example, suppose we                  ----------                            
remove 40 from the large                 |20 xx 50|                            
tree depicted in [2].  The            /-----------------\                      
root node contains 40,            /--/      /   \        \-----\               
which we mark "xx" to         ----      ----      ----          ----------     
remind us that we plan to     |14|      |32|      |43|          |62 70 79|     
replace it with the           ----      ----      ----          ----------     
smallest key in the root      /  \      /  \      /  \          /  |  |   \    
node's right subtree.  To  ---- ---- ---- ---- ---- ---- ------- ---- ---- ----
find that key, we move on  |10| |18| |25| |33| |42| |47| |57 60| |66| |74| |81|
to the 1-key node labeled  ---- ---- ---- ---- ---- ---- ------- ---- ---- ----
50.  Following our rules
for 1-key nodes, we fuse 50 with its sibling and parent to create a new 3-key
root labeled "20 xx 50".

Next, we visit the node                     ----------
labeled 43.  Again                          |20 xx 62|
following our rules for                 /--------------------\
1-key nodes, we rotate            /----/    /       \         \-----\
62 from a sibling to the      ----      ----      -------            -------
root, and move 50 from        |14|      |32|      |43 50|            |70 79|
the root to the node          ----      ----      -------            -------
containing 43.                /  \      /  \     /   |   \           /  |  \
                           ---- ---- ---- ---- ---- ---- ------- ---- ---- ----
                           |10| |18| |25| |33| |42| |47| |57 60| |66| |74| |81|
                           ---- ---- ---- ---- ---- ---- ------- ---- ---- ----

Finally, we move down to                    ----------                         
the node labeled 42.  A                     |20 xx 62|                         
different rule for 1-key               /--------------------\                  
nodes requires us to             /----/        /  \          \-----\           
fuse the nodes labeled       ----      -------/    \------          -------    
42 and 47 into a 3-key       |14|      |32|           |50|          |70 79|    
node, stealing 43 from       ----      ----           ----          -------    
the parent node.             /  \      /  \           /  \          /  |  \    
                          ---- ---- ---- ---- ---------- ------- ---- ---- ----
                          |10| |18| |25| |33| |42 43 47| |57 60| |66| |74| |81|
                          ---- ---- ---- ---- ---------- ------- ---- ---- ----

The last step is to remove 42 from the leaf and replace "xx" with 42.

Running Times
-------------
A 2-3-4 tree with height h has between 2^h and 4^h leaves.  If n is the total
number of entries (including entries in internal nodes), then n >= 2^(h+1) - 1.
By taking the logarithm of both sides, we find that h is in O(log n).

The time spent visiting a 2-3-4 node is typically longer than in a binary
search tree (because the nodes and the rotation and fusion operations are
complicated), but the time per node is still in O(1).

The number of nodes visited is proportional to the height of the tree.  Hence,
the running times of the find(), insert(), and remove() operations are in O(h)
and hence in O(log n), even in the worst case.

Compare this with the Theta(n) worst-case time of ordinary binary search trees.

Another Approach to Duplicate Keys
----------------------------------
Rather than have a separate node for each entry, we might wish to collect all
the entries that share a common key in one node.  In this case, each node's
entry becomes a list of entries.  This simplifies the implementation of
findAll(), which finds all the entries with a specified key.  It also speeds up
other operations by leaving fewer nodes in the tree data structure.  Obviously,
this is a change in the implementation, but not a change in the dictionary ADT.

This idea can be used with hash tables, binary search trees, and 2-3-4 trees.