NEURAL NETWORKS
===============
Can do both classification and regression.

[Ties together several ideas from the course:  perceptrons, logistic
 regression, ensembles of learners, and stochastic gradient descent.
 It also ties in the idea of lifting samples to a higher-dimensional feature
 space, but with a new twist:  neural nets can learn features themselves.]

[I want to begin by reminding you of the story I told you at the beginning of
 the semester, about Frank Rosenblatt's invention of perceptrons in 1957.
 Remember that he held a press conference where he predicted that perceptrons
 would be "the embryo of an electronic computer that [the Navy] expects will be
 able to walk, talk, see, write, reproduce itself and be conscious of its
 existence."]
[Perceptron research continued, but something monumental happened in 1969.
 Marvin Minsky, one of the founding fathers of AI, and Seymour Papert published
 a book called "Perceptrons".  Sounds good, right?  Well, part of the book was
 devoted to things perceptrons can't do.  And one of those things is XOR.]

                                      x
                                       1

                           XOR  |  0  |  1  
                         -------+-----+-----
                            0   |  0     1
                     x   -------+
                      2     1   |  1     0

[Think of the four outputs here as sample points in two-dimensional space.
 Two of them are in class 1, and two of them are in class 0.  We want to find
 a linear classifier that separates the 1's from the 0's.  Can we do it?  No.]
[So Minsky and Papert were basically saying, "Frank.  You're telling us this
 machine is going to be conscious of its own existence but it can't do XOR?"]
[The book had a devastating effect on the field.  After its publication, almost
 no research was done on neural net-like ideas for a decade, a time we now call
 the "AI Winter".  Shortly after the book was published, Frank Rosenblatt died.
 Officially, he died in a boating accident.  But we all know he died of
 a broken heart.]

[One thing I don't understand is why nobody at the time pointed out some almost
 obvious ways to get around this problem.  Here's the easiest.]

If you add one new quadratic feature, x_1 x_2, XOR is linearly separable in 3D.

                               0----------1
                              /|         /|
                             / |        / |
                            +----------+  |
                            |  |       |  |
                            |  |       |  |
                            |  1-------|--+
                            | /        | /
                            |/         |/
                            +----------0

[Now we can find a plane that cuts through the cube obliquely and separates the
 0's from the 1's.]

[However, there's an even more powerful way to do XOR.  The idea is to design
 linear classifiers whose output is the input to other linear classifers.
 That way, you should be able to emulate arbitrarily logical circuits.
 Suppose I put together some linear predictor functions like this.]

                 ------------------
    x  --------->|  linear combo  |--
     1   ------->------------------  \    ------------------
       \/                             --->|  linear combo  |--> z
       /\                             --->------------------
      /  ------->------------------  /
    x ---------->|  linear combo  |--
     2           ------------------

[If I interpret the output as 1 if z is positive or 0 is z is negative, can
 I do XOR with this?]

A linear combo of a linear combo is a linear combo...only works for linearly
separable samples.

[We need one more idea to make neural nets.  We need to add some sort of
 nonlinearity between the linear combinations.  Let's call these boxes that
 compute linear combinations "neurons".  If a neuron runs the linear
 combination it computes through some nonlinear function before sending it on
 to other neurons, then the neurons can act somewhat like logic gates.  The
 nonlinearity could be as simple as clamping the output so it can't go below
 zero.  And that's actually used in practice sometimes.]
[The most popular traditional choice has been to use the logistic function.
 The logistic function can't go below zero or above one, which is nice because
 it can't ever get huge and oversaturate the other neurons it's sending
 information to.  The logistic function is also smooth, which means it has
 well-defined gradients we can use in gradient descent.]
[With logistic functions between the linear combinations, here's a two-level
 perceptron that computes the XOR function.  Note that the logistic function
 at the output is optional; we could just take the sign of the output instead.]

                         NAND
                 --------------------\    v
    x  --------->| s(30 - 20x - 20y)  )O--             AND
         ------->--------------------/    \    -------------------\
       \/                                  --->| s(20v + 20w - 30) )-> x XOR y
       /\                                  --->-------------------/
      /  ------>\--------------------\    /
    y ---------> ) s(20x + 20y - 10)  >---
                /--------------------/    w
                          OR

Network with 1 Hidden Layer
---------------------------
[Note:  see the hand-drawn version of this section with an illustration.]

Input layer:  x , ..., x  ; x    = 1
               1        d    d+1

Hidden units:  h , ..., h  ; h    = 1
                1        m    m+1

Output layer:  a , ..., z
                1        k

[We might have more than one output so that we can build multiple classifiers
 that share hidden units.  One of the interesting advantages of neural nets is
 that if you train multiple classifiers simultaneously, sometimes some of them
 come out better because they can take advantage of particularly useful hidden
 units that first emerged to support one of the other classifiers.]

Layer 1 weights:  m x (d+1) matrix V            V_i is row i
Layer 2 weights:  k x (m+1) matrix W            W_i is row i

                                   1
Recall logistic fn s(gamma) = -----------.  Other nonlinear fns can be used.
                                   -gamma
                              1 + e

                     -        -
                     | s(v_1) |
                     |        |
For vector v, s(v) = | s(v_2) |.       [We apply s to a vector component-wise.]
                     |   .    |
                     |   .    |
                     -   .    -

[At this point, see hand-drawn page for depiction of neural net with one
 hidden layer.]

          3
  h  = s(sum V   x )               In short, h = s(Vx)
   i     i=1  ij  j

  z = s(Wh) = s(W s (Vx))
                   1
                   ^
                  add a 1 to end of vector

[We can add more hidden layers, and for some tasks it's common to have up to
 five hidden layers.  There are many variations you can experiment with--for
 instance, you can have connections that go forward more than one layer.]

Training
--------
Usually stochastic or batch gradient descent.

Pick loss fn L(z, y)              e.g. L(z, y) = |z - y|^2
               ^  ^
     predictions  true values (could be a vector)

                   n
Cost fn is J(h) = sum L(h(X ), Y )
                  i=1      i    i

[I'm using a capital Y here because now Y is a matrix with one row for each
 sample point and one column for each output of the neural net.  Often there
 will be just one output, but many neural net applications have more.]
Usually there are many local minima!
[The cost function for a neural net is, generally, not even close to convex.
 For that reason, it's possible to wind up in a bad minimum.  We'll talk later
 about some clever ways to coax neural nets into better minima.]
[Now let me ask you this.  Suppose we start by setting all the weights to zero,
 and then we do gradient descent on the weights.  What will go wrong?]
[The gradient descent has no way to break symmetry, you can get stuck in
 a situation where all the weights in each layer have the same value, and they
 have no way to become different from each other.  To avoid this problem, and
 in the hopes of finding a better minimum, we start with random weights.]

Rewrite all the weights in V & W as a vector w.  Batch gradient descent:

  w <- vector of random weights
  repeat
    w <- w - epsilon grad J(w)

[It's important to make sure the random weights aren't too big, because if
 a unit's output gets too close to zero or one, it can get "stuck", meaning
 that it always has roughly the same output value regardless of the input.
 Stuck units tend to stay stuck because in that operating range, the gradient
 s'(.) of the logistic function is close to zero.]
[We can instead use stochastic gradient descent, which means we use the
 gradient of one sample point's loss function at each step.  Typically, we
 shuffle the points in a random order, or just pick one randomly at each step.]
[The hard part of this algorithm is computing the gradient.  If you simply
 derive one derivative for each weight, you'll find that it takes time linear
 in the size of the neural network to compute a derivative for one weight in
 the first layer.  Multiply that by the number of weights.  We're going to
 spend some time learning to improve the running time to linear in the number
 of weights.]

Naive gradient computation:  O(units x edges) time
Backpropagation:  O(edges) time

Computing Gradients for Arithmetic Expressions
----------------------------------------------
[See the two hand-drawn pages.]

The Backpropagation Alg
-----------------------
[See the hand-drawn page.]