DECISION TREES (continued)
==============

Multivariate splits
-------------------
Find non-axis-aligned splits with other classification algs or by generating
them randomly.
[Show figure where regular decision tree needs many splits to approximate
 a diagonal linear decision boundary (multivariate.pdf).]
[This gives you all the power of classification algorithms such as SVMs,
 logistic regression, and Gaussian discriminant analysis; moreover, it can make
 them more powerful by making them hierarchical, so they're not limited to just
 one boundary.]
May gain better classifier at cost of worse interpretability.
Can limit # of features per split:  forward stepwise selection, Lasso.

Decision Tree Regression
------------------------
Creates a piecewise constant regression fn.
[Show decision tree regression figures (regresstree.pdf, regresstreefn.pdf).]

                        _ 2        _
Cost J(S) = sum   (y  - y) , where y is the mean y  for sample subset S.
           i in S   i                             i

[So if all the points in a node have the same y-value, then the cost is zero.]
[We choose the split that minimizes the weighted average of the costs of the
 children after the split.]

Stopping early
--------------
Why?
- Limit tree depth (for speed)
- Limit tree size (big data sets)
- Complete tree may overfit
- Given noise or overlapping distributions, purity of leaves is
  counterproductive; better to estimate posterior probs
[When you have overlapping class distributions, it's better to estimate
 a posterior probability than to always give a yes/no answer.  Refining
 the tree down to one sample point per leaf is absolutely guaranteed to
 overfit.  Whereas if you have enough points in each leaf, you can estimate
 posterior probabilities.]

[Show figure of multi-class leaves and posterior probability histogram for
 a leaf (leaf.pdf).]

How?  Select stopping condition(s):
- Next split doesn't reduce entropy/error enough (dangerous; pruning better)
- Most of node's points (e.g. > 95%) have same class    [to deal with outliers]
- Node contains few sample points (e.g. < 10)
- Node covers tiny volume
- Depth too great
- Use (cross-)validation to compare.
[The last is the slowest but most effective way to know when to stop:  use
 validation to decide whether splitting the node is a win or a loss.  But if
 your goal is to avoid overfitting, it's generally even more effective to grow
 the tree a little too large and then use validation to prune it back.  We'll
 talk about that next.]

Leaves with multiple points return
- a majority vote or class posterior probs (classification) or
- an average (regression).


Pruning
-------
Grow tree too large; greedily remove each split whose removal improves
cross-validation performance.  More reliable than stopping early.
[We have to do cross-validation once for each split that we're considering
 removing.  But you can do that pretty cheaply.  What you *don't* do is
 reclassify every sample point from scratch.  Instead, you keep track of which
 points in the validation set end up at which leaf.  When you are deciding
 whether to remove a split, you just look at the points in the two leaves
 you're thinking of removing, and see how they will be reclassified and how
 that will change the error rate.  You can do this very quickly.]
[The reason why pruning often works better than stopping early is because
 often a split that doesn't seem to make much progress is followed by a split
 that makes a lot of progress.  If you stop early, you'll never get there.]

[Show chart of tree size vs. error for baseball hitter data and final decision
 tree (prunehitters.pdf; prunedhitters.pdf).  Players' salaries:
 R1 = $165,174, R2 = $402,834, R3 = $845,346.]

ENSEMBLE LEARNING
=================
Decision trees are fast, simple, interpretable, easy to explain,
invariant under scaling/translation, robust to irrelevant features.

But not the best at prediction.  [Compared to previous methods we've seen.]
High variance.

[For example, suppose we take a training data set, split it into two halves,
 and train two decision trees, one on each half of the data.  It's not uncommon
 for the two trees to turn out very different.  In particular, if the two trees
 pick different features for the very first split at the top of the tree, then
 it's quite common for the trees to be completely different.  So decision trees
 tend to have high variance.]
[So let's think about how to fix this.  As an analogy, imagine that you are
 generating random numbers from some distribution.  If you generate just one
 number, it might have high variance.  But if you generate n numbers and take
 their average, then the variance of that average is n times smaller.  So you
 might ask yourself, can we reduce the variance of decision trees by taking an
 average answer of a bunch of decision trees?  Yes we can.]

[Show book cover "The Wisdom of Crowds" and Penelope the cow (wisdom.jpg,
 penelope.jpg).]

[A 1906 county fair in Plymouth, England had a contest to guess the weight of
 an ox.  A scientist named Francis Galton was there, and he did an experiment.
 He calculated the median of everyone's guesses.  The median guess was 1,207
 pounds, and the true weight was 1,198 pounds, so the error was less than 1%.
 Even the cattle experts present didn't estimate it that accurately.]
[NPR repeated the experiment in 2015 with a cow named Penelope whose photo they
 published online.  They got 17,000 guesses, and the average guess was 1,287
 pounds.  Penelope's actual weight was 1,355 pounds, so the crowd got it to
 within 5 percent.]
[The main idea is that sometimes the average opinion of a bunch of idiots is
 better than the opinion of one expert.  And so it is with learning algorithms.
 We call a learning algorithm a "weak learner" if it does better than guessing
 randomly.  And we combine a bunch of weak learners to get a strong one.]
[Incidentally, James Surowiecki, the author of the book, guessed 725 pounds for
 Penelope.  So he was off by 87%.  He's like a bad decision tree who wrote
 a book about how to average decision trees.]


We can take average of output of:
- Different learning algs
- Same learning alg on many training sets             [if we have tons of data]
- _Bagging_:  Same learning alg on many random subsamples of one training set.
- _Random_forests_:  Randomized decision trees on random subsamples.
[These last two are the most common ways to use averaging, because usually we
 don't have enough training data to use fresh data for every learner.]

[Averaging is not specific to decision trees; it can work with many different
 learning algorithms.  But it works particularly well with decision trees.]

Regression algs:  take median or mean output
Classification algs:  take majority vote OR average posterior probs

[Show averageaxis.mov]  [Here's a simple classifier that takes an average of
                         "stumps", trees of depth 1.  Observe how good the
                         posterior probabilities look.]]
[Show averageaxistree.mov]  [Here's a 4-class classifier with depth-2 trees.]

[The Netflix Prize was an open competition for the best collaborative filtering
 algorithm to predict user ratings for films, based on previous ratings.  It
 ran for three years and ended in 2009.  The winners used an extreme ensemble
 method that took an average of many different learning algorithms.  In fact, a
 couple of top teams combined into one team so they could combine their
 methods.  They said, "Let's average our models and split the money," and
 that's what happened.]

Use learners with low bias (e.g. deep decision trees).
High variance & some overfitting is okay.  Averaging reduces the variance!
[Each learner may overfit, but each overfits in its own unique way.]
Averaging sometimes reduces the bias & increases flexibility;
  e.g. creating nonlinear decision boundary from linear classifiers.
Hyperparameter settings usually different than 1 learner.
[Because averaging learners reduces their variance.  But averaging rarely
 reduces bias as much as it reduces variance, so you want to get the bias nice
 and small before you average.]
# of trees is another hyperparameter.

Bagging = _Bootstrap_AGGregatING_ (Leo Breiman, 1994)
-------
[Leo Breiman was a statistics professor right here at Berkeley.  He did his
 best work after he retired in 1993.  The bagging algorithm was published the
 following year, and then he went on to co-invent random forests as well.
 Unfortunately, he died in 2005.]
[Show photo (breiman.gif).]
[Bagging is a randomized method for creating many different learners from the
 same data set.  It works well with many different learning algorithms.
 One exception seems to be k-nearest neighbors; bagging mildly degrades it.]
Given n-point training sample, generate random subsample of size n' by sampling
*with*replacement*.  Some points chosen multiple times; some not chosen.

                                1 3 4 6 8 9
                              /             \
                              v             v
                       6 3 6 1 1 9       8 4 6 9 1 8

If n' = n, ~ 63.2% are chosen.  [On average; this fraction varies randomly.]
Build learner.  Points chosen j times have greater weight:
[If a point is chosen j times, we want to treat it the same way we would treat
 j different points all bunched up infinitesimally close together.]
- Decision trees:  j-time point has j x weight in entropy.
- SVMs:  j-time point incurs j x penalty to violate margin.
- Regression:  j-time point incurs j x loss.

Repeat until T learners.  Metalearner takes test point, feeds it into
all T learners, returns average/majority output.

Random Forests
--------------
Random sampling isn't random enough!
[With bagging, often the decision trees look very similar.  Why is that?]
One really strong predictor -> same feature split at top of every tree.
[For example, if you're building decision trees to identify spam, the first
 split might always be "viagra".  Random sampling doesn't change that.  If the
 trees are very similar, then taking their average doesn't reduce the variance
 much.]
Idea:  At each split, take random sample of m features (out of d).
       Choose best split from m features.
       [We're not allowed to split on the other d - m features!]
       Different random sample for each split.
       m ~ sqrt(d) works well for classification; m ~ d/3 for regression.
       [So if you have 100-dimensional feature space, you randomly choose 10
        features and pick the one of those 10 that gives the best split.
        But m is a hyperparameter, and you might get better results by tuning
        it for your particular application.]
       Smaller m -> more randomness, less tree correlation, more bias
[One reason this works is if there's a really strong predictor, only a fraction
 of the trees can choose that predictor as the first split.  That fraction is
 m/d.  So the split tends to "decorrelate" the trees.  And that means that when
 you take the average of the trees, you'll have less variance.]
[You have to be careful, though, because you don't want to dumb down the trees
 too much in your quest for decorrelation.  Averaging works best when you have
 very strong learners that are also diverse.  But it's hard to create a lot of
 learners that are very different yet all very smart.  The Netflix Prize
 winners did it, but it was a huge amount of work.]

Sometimes test error reduction up to 100s or even 1,000s of decision trees!
Disadvantage:  loses interpretability/inference.
[But the compensation is it's more accurate than a single decision tree.]

Variation:  generate m random multivariate splits (oblique lines, quadrics);
choose best split.
[You have to be a bit clever about how you generate random decision boundaries;
 I'm not going to discuss that today.]

[Show treesidesdeep.mov]  [Lots of good-enough conic random decision trees.]
[Show averageline.mov]
[Show averageconic.mov]
[Show square.mov]  [Depth 2; look how good the posterior probabilities look.]
[Show squaresmall.mov]  [Depth 2; see the uncertainty away from the center.]
[Show spiral2.mov]  [Doesn't look like a decision tree at all, does it?]
[Show overlapdepth14.mov]  [Overlapping classes.  This example overfits!]
[Show overlapdepth5.mov]  [Better fit.]

[Show spiral at depths 4 & 12, axis/line/conic (500.pdf).]
[Show spiral w/axis-aligned tree at depths 4 & 12, with 1, 5, or 50 random
 multivariate splits (randomness.pdf).]