NEURONS
=======
[The field of artificial intelligence started with some wrong premises.
 The early AI researchers attacked problems like chess and theory proving,
 because they thought those exemplified the essence of intelligence.  They
 didn't pay much attention at first to problems like vision and speech
 understanding.  Any four-year-old can do those things, and so researchers
 underestimated their difficulty.  Today, we know better.  Computers can
 effortlessly beat four-year-olds at chess, but they still can't play with toys
 nearly as well.  We've come to realize that rule-based symbol manipulation is
 not the sole defining mark of intelligence.  Even rats do computations that
 we're hard pressed to match with our computers.  We've also come to realize
 that these are different classes of problems that require very different
 styles of computation.  Brains and computers have very different strengths and
 weaknesses, which reflect these different computing styles.]
[Neural networks are partly inspired by the workings of actual brains.  Let's
 take a look at a few things we know about biological neurons, and contrast
 them with both neural nets and traditional computation.]

- CPUs:  largely sequential, nanosecond gates, fragile if gate fails
  superior for 234 x 718, logical rules, perfect key-based memory
- Brains:  very parallel, millisecond neurons, fault-tolerant
  [Neurons are continually dying.  You've probably lost a few since this
   lecture started.  But you probably didn't notice.  And that's interesting,
   because it points out that our memories are stored in our brains in
   a diffuse representation.  There is no one neuron whose death will make you
   forget that 2 + 2 = 4.  Artificial neural nets often share that resilience.
   Brains and neural nets seem to superpose memories on top of each other, all
   stored together in the same weights, sort of like a hologram.]
  superior for vision, speech, associative memory
  [By "associative memory", I mean noticing connections between things.
   One thing our brains are very good at is retrieving a pattern if we specify
   only a portion of the pattern.]

[It's impressive that even though a neuron needs a few milliseconds to transmit
 information to the next neurons downstream, we can perform very complex tasks
 like interpreting a visual scene in a tenth of a second.  This is possible
 because neurons are parallel, but also because of their computation style.]
[Neural nets try to emulate the parallel, associative thinking style of brains,
 and they are among the best techniques we have for many fuzzy problems,
 including some problems in vision and speech.  Not coincidentally, neural nets
 are also inferior at many traditional computer tasks such as multiplying
 numbers with lots of digits or compiling source code.]

[Show figure of neurons (neurons.pdf).]

- _Neuron_:  A cell in brain/nervous system for thinking/communication
- _Action_potential_ or _spike_:  An electrochemical impulse _fired_ by
  a neuron to communicate w/other neurons
- _Axon_:  The limb(s) along which the action potential propagates; "output"
  [Most axons branch out eventually, sometimes profusely near their ends.]
  [It turns out that giant squids have a very large axon they use for fast
   water jet propulsion.  The mathematics of action potentials was first
   characterized in these giant squid axons, and that work won a Nobel Prize
   in 1963.]
- _Dendrite_:  Smaller limbs by which neuron receives info; "input"
- _Synapse_:  Connection from one neuron's axon to another's dendrite
  [Some synapses connect axons to muscles or glands.]
- _Neurotransmitter_:  Chemical released by axon terminal to stimulate dendrite

[When an action potential reaches an axon terminal, it causes tiny containers
 of neurotransmitter, called _vesicles_, to empty their contents into the space
 where the axon terminal meets another neuron's dendrite.  That space is called
 the _synaptic_cleft_.  The neurotransmitters bind to receptors on the dendrite
 and influence the next neuron's body voltage.  This sounds incredibly slow,
 but it all happens in 1 to 5 milliseconds.]

You have about 10^11 neurons, each with about 10^4 synapses.
[Maybe 10^5 synapses after you pass CS 189.]

Analogies:  [between artificial neural networks and brains]
- Output of unit <-> firing rate of neuron
  [An action potential is "all or nothing"--they all have the same shape and
   size.  The output of a neuron is not a fixed voltage like the output of
   a transistor.  The output of a neuron is the frequency at which it fires.
   Some neurons can fire at nearly 1,000 times a second, which you might think
   of as a strong "1" output.  Conversely, some types of neurons can go for
   minutes without firing.  But some types of neurons never stop firing, and
   for those you might interpret a firing rate of 10 times per second as "0".]
- Weight of connection <-> synapse strength
- Positive weight <-> excitatory neurotransmitter (e.g. glutamine)
- Negative weight <-> inhibitory neurotransmitter (e.g. GABA, glycine)
  [Gamma aminobutyric acid.]
  [A typical neuron is either excitatory at all its axon terminals, or
   inhibitory at all its terminals.  It can't switch from one to the other.
   Artificial neural nets have an advantage here.]
- Linear combo of inputs <-> _summation_
  [A neuron fires when the sum of its inputs, integrated over time, reaches
   a high enough voltage.  However, the neuron body voltage also decays
   slowly with time, so if the action potentials coming in are slow enough, the
   neuron might not fire at all.]
- Logistic/sigmoid fn <-> firing rate saturation
  [A neuron can't fire more than 1,000 times a second, nor less than zero times
   a second.  This limits its ability to be the sole determinant of whether
   downstream neurons fire.  We accomplish the same thing with the sigmoid fn.]
- Weight change/learning <-> _synaptic_plasticity_
  Hebb's rule (1949):  "Cells that fire together, wire together."
  [This doesn't mean that the cells have to fire at exactly the same time.
   But if one cell's firing tends to make another cell fire more often, their
   excitatory synaptic connection tends to grow stronger.  There's a reverse
   rule for inhibitory connections.  And there are ways for neurons that aren't
   even connected to grow connections.]
  [There are simple computer learning algorithms based on Hebb's rule.
   It can work, but it's generally not nearly as fast or effective as
   backpropagation.]
[Backpropagation is one part of artificial neural networks for which there is
 no analogy in the brain.  Brains definitely do not do backpropagation.]

[Show Geoff Hinton Hebbian learning slides (hebbian.pdf).]

[But this two-layer network is not flexible enough to do digit recognition
 well, especially when you have multiple writers with different handwriting.
 You can do much better with a three-layer network and backpropagation.]

[The brain is very modular.]  [Show figure of brain lobes (brain.png).
- The part of our brain we think of as most characteristically human is the
  cerebral cortex, the seat of self-awareness, language, and abstract thinking.
But the brain has a lot of other parts that take the load off the cortex.
- Our brain stem regulates functions like heartbeat, breathing, and sleep.
- Our cerebellum governs fine coordination of motor skills.  When we talk about
  "muscle memory", much of that is in the cerebellum, and it saves us from
  having to consciously think about how to walk or talk or brush our teeth, so
  the cortex can focus on where to walk and what to say and checking our phone.
- Our limbic system is the center of emotion and motivation, and as such, it
  makes a lot of the big decisions.  I sometimes think that 90% of the job of
  our cerebral cortex is to rationalize decisions that have alread been made by
  the limbic system.  "Oh yeah, I made a careful, reasoned, logical decision to
  eat that fourth pint of ice cream."
- Our visual cortex performs a lot of processing on the input from your eyes
  to change it into a more useful form.  Neuroscientists and computer
  scientists are particularly interested in the visual cortex for several
  reasons.  Vision is an important problem for computers.  The visual cortex is
  one of the easier parts of the brain to study in lab animals.  The visual
  cortex is largely a feedforward network with few neurons going backward, so
  it's easier for us to train computers to behave like the visual cortex.]

[Although the brain has lots of specialized modules, one thing that's
 interesting about the cerebral cortex is that it seems to be made of
 general-purpose neural tissue that looks more or less the same everywhere, at
 least before it's trained.  If you experience damage to part of the cortex
 early enough in life, while your brain is still growing, the functions will
 just relocate to a different part of the cortex, and you'll probably never
 notice the difference.]

[As computer scientists, our primary motivation for studying neurology is to
 try to get clues about how we can get computers to do tasks that humans are
 good at.  But neurologists and psychologists have also been part of the study
 of neural nets from the very beginning.  Their motivations are scientific:
 they're curious how humans think, and why we can do what we can do.]


NEURAL NET VARIATIONS
=====================
[I want to show you a few basic variations on the standard neural network
 I showed you last class, and how some of them change backpropagation.]

Regression:  usually omit sigmoid fn from output unit(s).
[If you make that change, the gradient changes too, and you have to derive it
 for backprop.  The gradient gets simpler, so I'll leave it as an exercise.]

Classification:
- Logistic loss fn (aka _cross-entropy_) often preferred to squared error.
  L(z, y) = - sum (y  ln z  + (1 - y ) ln (1 - z ))
    ^  ^       i    i     i         i           i
    |  |
    |  true values  \ vectors
   prediction       /
- For 2 classes, use one sigmoid output; for k >= 3 classes, use _softmax_fn_.
  Let t = Wh be k-vector of linear combos in final layer.

                                t_j        d z                   d z
                               e              j                     j
  Softmax output is z (t) = --------- .    ---- = z  (1 - z )    ---- = - z  z
                     j       k    t_i      d t     j       j     d t       i  j
                            sum  e            j                     i
                            i=1                                       i != j

  [Each z_j is in the range (0, 1), and their sum is 1.]

[See my hand-drawn derivation of the backprop equations for the softmax output
 and logistic loss function.]

[Next I'm going to talk about a bunch of heuristics that make gradient descent
 faster, or make it find better local minima, or prevent it from overfitting.
 I suggest implementing vanilla stochastic backprop first, and experimenting
 with the other heuristics only after you get that working.]


Unit Saturation
---------------
Problem:  When unit output s is close to 0 or 1 for all training points,
s' = s (1 - s) ~ 0, so gradient descent changes s very slowly.  Unit is
"stuck".  Slow training & bad network.
[Show sigmoid function (logistic.pdf); show flat spots & linear region.]
[Wikipedia calls this the "vanishing gradient problem."]
[The more layers your network has, the more problematic this problem becomes.
 Most of the early attempts to train deep, many-layered neural nets failed.]

Mitigation:                                 [None of these are complete cures.]
(1)  Set target values to 0.15 & 0.85 instead of 0 & 1.
     [Recall that the sigmoid function can never be 0 or 1; it can only come
      close.  Relaxing the target values helps prevent the output units from
      getting saturated.  The numbers 0.15 and 0.85 are reasonable because
      the sigmoid function achieves its greatest curvature when its output is
      near 0.21 or 0.79.  But experiment to find the best values.]
     [This helps to avoid stuck output units, but not stuck hidden units.]

(2)  Modify backprop to add small constant (typically ~0.1) to s'.
     [This hacks the gradient so a unit can't get stuck.  We're not doing
      *steepest* descent any more, because we're not using the real gradient.
      But often we're finding a better descent direction that will get us to
      a minimum faster.]

(3)  Initial weight of edge into unit with fan-in eta:
     random with mean zero, std. dev. sqrt(eta).
     [The bigger the fan-in of a unit, the easier it is to saturate it.  So we
      choose smaller random initial weights for gates with bigger fan-in.]

(4)  Replace sigmoid with ReLUs:  _rectified_linear_units_.
     _ramp_fn_ aka _hinge_fn_:  s(gamma) = max { 0, gamma }
                                             /   1     gamma >= 0,
                                s'(gamma) = <
                                             \   0     gamma < 0.
     [The derivative is not defined at zero, but we can just make one up.]
     [Obviously, the gradient can be zero, so you might wonder if ReLUs can get
      stuck too.  Fortunately, it's rare for a ReLU's gradient to be zero for
      *all* the training data; it's usually zero for just some sample points.
      But yes, ReLUs sometimes get stuck too; just not as often as sigmoids.]
     Popular for many-layer networks with large training sets.
     [One nice thing about ramp functions is that they and their gradients are
      very fast to compute.  Computers compute exponentials slowly.]
     [Even though ReLUs are linear in half their range, they're still nonlinear
      enough to easily compute functions like XOR.]

[Note that option (4) makes the first three options irrelevant.]

Heuristics for Avoiding Bad Local Minima
----------------------------------------
- (1) or (4) above.

- Stochastic gradient descent.  A local minimum for batch descent is not
  a minimum for one typical training point.
  [The idea is that instead of trying to optimize one risk function, we descend
   on one example's loss function and then we descend on another example's loss
   function, and every loss function has different local minima.  It looks like
   a random walk or Brownian motion, and that random noise gets you out of
   shallow local minima.]
  [Show example of stochastic gradient descent (stochasticnnet.pdf).]

- Momentum.  Gradient descent changes "velocity" slowly.
  Carries us right through shallow local minima to deeper ones.

  Delta w <- - epsilon grad w
  repeat
    w <- w + Delta w
    Delta w <- - epsilon grad w + beta Delta w

  Good for both stochastic & batch descent.  Choose hyperparameter beta < 1.
  [Think of Delta w as the velocity.  The hyperparameter beta specifies how
   strongly momentum persists from iteration to iteration.]
  [I've seen conflicting advice on beta.  Some researchers set it to 0.9;
   some set it close to zero.]
  [If beta is large, epsilon tends to be smaller to compensate, in which case
   you might want to change the first line so the initial velocity is larger.]