NEURONS ======= [The field of artificial intelligence started with some wrong premises. The early AI researchers attacked problems like chess and theory proving, because they thought those exemplified the essence of intelligence. They didn't pay much attention at first to problems like vision and speech understanding. Any four-year-old can do those things, and so researchers underestimated their difficulty. Today, we know better. Computers can effortlessly beat four-year-olds at chess, but they still can't play with toys nearly as well. We've come to realize that rule-based symbol manipulation is not the sole defining mark of intelligence. Even rats do computations that we're hard pressed to match with our computers. We've also come to realize that these are different classes of problems that require very different styles of computation. Brains and computers have very different strengths and weaknesses, which reflect these different computing styles.] [Neural networks are partly inspired by the workings of actual brains. Let's take a look at a few things we know about biological neurons, and contrast them with both neural nets and traditional computation.] - CPUs: largely sequential, nanosecond gates, fragile if gate fails superior for 234 x 718, logical rules, perfect key-based memory - Brains: very parallel, millisecond neurons, fault-tolerant [Neurons are continually dying. You've probably lost a few since this lecture started. But you probably didn't notice. And that's interesting, because it points out that our memories are stored in our brains in a diffuse representation. There is no one neuron whose death will make you forget that 2 + 2 = 4. Artificial neural nets often share that resilience. Brains and neural nets seem to superpose memories on top of each other, all stored together in the same weights, sort of like a hologram.] superior for vision, speech, associative memory [By "associative memory", I mean noticing connections between things. One thing our brains are very good at is retrieving a pattern if we specify only a portion of the pattern.] [It's impressive that even though a neuron needs a few milliseconds to transmit information to the next neurons downstream, we can perform very complex tasks like interpreting a visual scene in a tenth of a second. This is possible because neurons are parallel, but also because of their computation style.] [Neural nets try to emulate the parallel, associative thinking style of brains, and they are among the best techniques we have for many fuzzy problems, including some problems in vision and speech. Not coincidentally, neural nets are also inferior at many traditional computer tasks such as multiplying numbers with lots of digits or compiling source code.] [Show figure of neurons (neurons.pdf).] - _Neuron_: A cell in brain/nervous system for thinking/communication - _Action_potential_ or _spike_: An electrochemical impulse _fired_ by a neuron to communicate w/other neurons - _Axon_: The limb(s) along which the action potential propagates; "output" [Most axons branch out eventually, sometimes profusely near their ends.] [It turns out that giant squids have a very large axon they use for fast water jet propulsion. The mathematics of action potentials was first characterized in these giant squid axons, and that work won a Nobel Prize in 1963.] - _Dendrite_: Smaller limbs by which neuron receives info; "input" - _Synapse_: Connection from one neuron's axon to another's dendrite [Some synapses connect axons to muscles or glands.] - _Neurotransmitter_: Chemical released by axon terminal to stimulate dendrite [When an action potential reaches an axon terminal, it causes tiny containers of neurotransmitter, called _vesicles_, to empty their contents into the space where the axon terminal meets another neuron's dendrite. That space is called the _synaptic_cleft_. The neurotransmitters bind to receptors on the dendrite and influence the next neuron's body voltage. This sounds incredibly slow, but it all happens in 1 to 5 milliseconds.] You have about 10^11 neurons, each with about 10^4 synapses. [Maybe 10^5 synapses after you pass CS 189.] Analogies: [between artificial neural networks and brains] - Output of unit <-> firing rate of neuron [An action potential is "all or nothing"--they all have the same shape and size. The output of a neuron is not a fixed voltage like the output of a transistor. The output of a neuron is the frequency at which it fires. Some neurons can fire at nearly 1,000 times a second, which you might think of as a strong "1" output. Conversely, some types of neurons can go for minutes without firing. But some types of neurons never stop firing, and for those you might interpret a firing rate of 10 times per second as "0".] - Weight of connection <-> synapse strength - Positive weight <-> excitatory neurotransmitter (e.g. glutamine) - Negative weight <-> inhibitory neurotransmitter (e.g. GABA, glycine) [Gamma aminobutyric acid.] [A typical neuron is either excitatory at all its axon terminals, or inhibitory at all its terminals. It can't switch from one to the other. Artificial neural nets have an advantage here.] - Linear combo of inputs <-> _summation_ [A neuron fires when the sum of its inputs, integrated over time, reaches a high enough voltage. However, the neuron body voltage also decays slowly with time, so if the action potentials coming in are slow enough, the neuron might not fire at all.] - Logistic/sigmoid fn <-> firing rate saturation [A neuron can't fire more than 1,000 times a second, nor less than zero times a second. This limits its ability to be the sole determinant of whether downstream neurons fire. We accomplish the same thing with the sigmoid fn.] - Weight change/learning <-> _synaptic_plasticity_ Hebb's rule (1949): "Cells that fire together, wire together." [This doesn't mean that the cells have to fire at exactly the same time. But if one cell's firing tends to make another cell fire more often, their excitatory synaptic connection tends to grow stronger. There's a reverse rule for inhibitory connections. And there are ways for neurons that aren't even connected to grow connections.] [There are simple computer learning algorithms based on Hebb's rule. It can work, but it's generally not nearly as fast or effective as backpropagation.] [Backpropagation is one part of artificial neural networks for which there is no analogy in the brain. Brains definitely do not do backpropagation.] [Show Geoff Hinton Hebbian learning slides (hebbian.pdf).] [But this two-layer network is not flexible enough to do digit recognition well, especially when you have multiple writers with different handwriting. You can do much better with a three-layer network and backpropagation.] [The brain is very modular.] [Show figure of brain lobes (brain.png). - The part of our brain we think of as most characteristically human is the cerebral cortex, the seat of self-awareness, language, and abstract thinking. But the brain has a lot of other parts that take the load off the cortex. - Our brain stem regulates functions like heartbeat, breathing, and sleep. - Our cerebellum governs fine coordination of motor skills. When we talk about "muscle memory", much of that is in the cerebellum, and it saves us from having to consciously think about how to walk or talk or brush our teeth, so the cortex can focus on where to walk and what to say and checking our phone. - Our limbic system is the center of emotion and motivation, and as such, it makes a lot of the big decisions. I sometimes think that 90% of the job of our cerebral cortex is to rationalize decisions that have alread been made by the limbic system. "Oh yeah, I made a careful, reasoned, logical decision to eat that fourth pint of ice cream." - Our visual cortex performs a lot of processing on the input from your eyes to change it into a more useful form. Neuroscientists and computer scientists are particularly interested in the visual cortex for several reasons. Vision is an important problem for computers. The visual cortex is one of the easier parts of the brain to study in lab animals. The visual cortex is largely a feedforward network with few neurons going backward, so it's easier for us to train computers to behave like the visual cortex.] [Although the brain has lots of specialized modules, one thing that's interesting about the cerebral cortex is that it seems to be made of general-purpose neural tissue that looks more or less the same everywhere, at least before it's trained. If you experience damage to part of the cortex early enough in life, while your brain is still growing, the functions will just relocate to a different part of the cortex, and you'll probably never notice the difference.] [As computer scientists, our primary motivation for studying neurology is to try to get clues about how we can get computers to do tasks that humans are good at. But neurologists and psychologists have also been part of the study of neural nets from the very beginning. Their motivations are scientific: they're curious how humans think, and why we can do what we can do.] NEURAL NET VARIATIONS ===================== [I want to show you a few basic variations on the standard neural network I showed you last class, and how some of them change backpropagation.] Regression: usually omit sigmoid fn from output unit(s). [If you make that change, the gradient changes too, and you have to derive it for backprop. The gradient gets simpler, so I'll leave it as an exercise.] Classification: - Logistic loss fn (aka _cross-entropy_) often preferred to squared error. L(z, y) = - sum (y ln z + (1 - y ) ln (1 - z )) ^ ^ i i i i i | | | true values \ vectors prediction / - For 2 classes, use one sigmoid output; for k >= 3 classes, use _softmax_fn_. Let t = Wh be k-vector of linear combos in final layer. t_j d z d z e j j Softmax output is z (t) = --------- . ---- = z (1 - z ) ---- = - z z j k t_i d t j j d t i j sum e j i i=1 i != j [Each z_j is in the range (0, 1), and their sum is 1.] [See my hand-drawn derivation of the backprop equations for the softmax output and logistic loss function.] [Next I'm going to talk about a bunch of heuristics that make gradient descent faster, or make it find better local minima, or prevent it from overfitting. I suggest implementing vanilla stochastic backprop first, and experimenting with the other heuristics only after you get that working.] Unit Saturation --------------- Problem: When unit output s is close to 0 or 1 for all training points, s' = s (1 - s) ~ 0, so gradient descent changes s very slowly. Unit is "stuck". Slow training & bad network. [Show sigmoid function (logistic.pdf); show flat spots & linear region.] [Wikipedia calls this the "vanishing gradient problem."] [The more layers your network has, the more problematic this problem becomes. Most of the early attempts to train deep, many-layered neural nets failed.] Mitigation: [None of these are complete cures.] (1) Set target values to 0.15 & 0.85 instead of 0 & 1. [Recall that the sigmoid function can never be 0 or 1; it can only come close. Relaxing the target values helps prevent the output units from getting saturated. The numbers 0.15 and 0.85 are reasonable because the sigmoid function achieves its greatest curvature when its output is near 0.21 or 0.79. But experiment to find the best values.] [This helps to avoid stuck output units, but not stuck hidden units.] (2) Modify backprop to add small constant (typically ~0.1) to s'. [This hacks the gradient so a unit can't get stuck. We're not doing *steepest* descent any more, because we're not using the real gradient. But often we're finding a better descent direction that will get us to a minimum faster.] (3) Initial weight of edge into unit with fan-in eta: random with mean zero, std. dev. sqrt(eta). [The bigger the fan-in of a unit, the easier it is to saturate it. So we choose smaller random initial weights for gates with bigger fan-in.] (4) Replace sigmoid with ReLUs: _rectified_linear_units_. _ramp_fn_ aka _hinge_fn_: s(gamma) = max { 0, gamma } / 1 gamma >= 0, s'(gamma) = < \ 0 gamma < 0. [The derivative is not defined at zero, but we can just make one up.] [Obviously, the gradient can be zero, so you might wonder if ReLUs can get stuck too. Fortunately, it's rare for a ReLU's gradient to be zero for *all* the training data; it's usually zero for just some sample points. But yes, ReLUs sometimes get stuck too; just not as often as sigmoids.] Popular for many-layer networks with large training sets. [One nice thing about ramp functions is that they and their gradients are very fast to compute. Computers compute exponentials slowly.] [Even though ReLUs are linear in half their range, they're still nonlinear enough to easily compute functions like XOR.] [Note that option (4) makes the first three options irrelevant.] Heuristics for Avoiding Bad Local Minima ---------------------------------------- - (1) or (4) above. - Stochastic gradient descent. A local minimum for batch descent is not a minimum for one typical training point. [The idea is that instead of trying to optimize one risk function, we descend on one example's loss function and then we descend on another example's loss function, and every loss function has different local minima. It looks like a random walk or Brownian motion, and that random noise gets you out of shallow local minima.] [Show example of stochastic gradient descent (stochasticnnet.pdf).] - Momentum. Gradient descent changes "velocity" slowly. Carries us right through shallow local minima to deeper ones. Delta w <- - epsilon grad w repeat w <- w + Delta w Delta w <- - epsilon grad w + beta Delta w Good for both stochastic & batch descent. Choose hyperparameter beta < 1. [Think of Delta w as the velocity. The hyperparameter beta specifies how strongly momentum persists from iteration to iteration.] [I've seen conflicting advice on beta. Some researchers set it to 0.9; some set it close to zero.] [If beta is large, epsilon tends to be smaller to compensate, in which case you might want to change the first line so the initial velocity is larger.]