Question? Leave a message!




Learning and Memory in Cognitive Systems

Learning and Memory in Cognitive Systems
Intelligent Control and Cognitive Systems brings you... Learning and Memory in Cognitive Systems Joanna J. Bryson University of Bath, United KingdomSensing vs Perception First week: Sensing – what information • comes in. This week: Perception – what you think is • going on. Perception includes expectations. • Necessary for disambiguating noisy and • impoverished sensory information.“expectations” Bayes’  Theorem posterior  ∝ likelihood  ×  prior Given you’ve seen X, you can figure out if Y is likely true based on what you already know about the probability of experiencing: X independently, Y independently and X when you see Y.note to JB: copy X,Y to board, useful later One Application... Y – potential action • X – sensing • priors = memory • priors + sense = perception •Expectations For all cognitive systems, some priors are • hardcoded: body shape, sensing array, even neural connectivity. Derived from the experience of evolution • or from a designer. Other expectations are derived from an • individual’s own experience – learning.Learning Learning requires: • A representation. • A means of acting on current evidence. • A means of incorporating feedback • concerning the outcome of the guess. AI learning calls incorporating feedback • “error correction”.Yann LeCun (NYU) LearningisNOTMemorization rote learning is easy: just memorize all the training examples and their corresponding outputs. when a new input comes in, compare it to all the memorized samples, and producethe output associated with the matching sample. PROBLEM: in general, new inputs are differentfrom training samples. Theabilitytoproducecorrectoutputsorbehavioronpreviouslyunseeninputsis called GENERALIZATION. rote learning is memorization without generalization. The big question of Learning Theory (and practice): how to get good generalization with a limited numberof examples. Y. LeCun: Machine Learning andPattern Recognition– p. 10/29Learning Outcomes Objective is to do the right thing at the • right time (to be intelligent.) Doing the right thing often requires • predicting likely possible sensory conditions so you can disambiguate situations that would otherwise be perceptually aliased.TwoKindsofSupervisedLearning What we’ll use as an example today. Regression: also known as “curve fitting” or“function approximation”. Learn a continuous inputoutput mapping from a limited number of examples (possibly noisy). Classification: outputs are discrete vari ables (category labels). Learn a decision boundary that separates one class from the other. Generally, a “confidence” is also de sired (how sure are we that the input be longs to the chosen category). Includes kernel methods (not covered here.) Y. LeCun: Machine Learning and Pattern Recognition – p. 8/29c.f. Lecture 5 “what the UnsupervisedLearning brain seems to be doing” Unsupervisedlearningcomesdowntothis: iftheinputlooks likethetrainingsamples, outputasmallnumber,ifitdoesn’t,outputalargenumber. Thisisahorrendouslyillposedprobleminhigh dimension. Todoitright,wemustguess/discover thehiddenstructureoftheinputs. Methodsdiffer bytheirassumptionsaboutthenatureofthedata. ASpecialCase: DensityEstimation. Finda functionf suchf(X)approximatesthe probabilitydensityofX,p(X),aswellas possible. Clustering: discover“clumps”ofpoints Embedding: discoverlowdimensionalmanifold orsurfacenearwhichthedatalives. Compression/Quantization: discover a function that for each input computes a compact “code” fromwhichtheinputcanbereconstructed. Y.LeCun: MachineLearningandPatternRecognition–p.9/29“Regression” via Chris Bishop Polynomial  Curve  Fi:ng   Representa)on:    Just  a  polynomial  equaon.Example  Applicaon  to  Acon  Selecon What  you  sensed Where  to  drive  your  motor.Sum­‐of­‐Squares  Error  Funcon Use  data  to  fix   the  world   model  currently   held  in  the   representaon.Error functions Based on some parameter w (for weight – • more on why it’s called that later). Objective is to minimise error function. • Take its derivative with respect to w. • Go down (take second deriv. if nec.) • Linear functions gives a nice U function ∴ • you can tell when your done, derivative = 0.Theory vs Practice If we assume that noise in signal is • Normally distributed (with fixed variance), then least squares is equivalent to probabilistic methods (Per CM20220). Least squares is a lot easier to implement • lighterweight to run. To the extent the assumption doesn’t hold, • quality of results degrades – may be OK.Why Representations Matter Green line is model used to generate data (in combination with noise). Red line is the model learned from observing that data.th 0  Order  Polynomialst 1  Order  Polynomialrd 3  Order  Polynomialth 9  Order  PolynomialWhen  your  model  is  too  powerful  for  the   data,  it  just  “rote  memorises”  without   Over­‐fi:ng generalising. That  means  you   get  beZer  on   training  data  but   worse  on  data   you  haven’t  seen. Root­‐Mean­‐Square  (RMS)  Error:Spot  the  indicaon   of  a  problem. Polynomial  Coefficients      More  data  makes  it   beZer... Data  Set  Size:   th 9  Order  Polynomialand  beZer...  more  data  is   more  informaon  on  the   Data  Set  Size:   underlying  model th 9  Order  PolynomialYann LeCun (NYU) LearningisNOTMemorization rote learning is easy: just memorize all the training examples and their corresponding outputs. when a new input comes in, compare it to all the memorized samples, and producethe output associated with the matching sample. PROBLEM: in general, new inputs are differentfrom training samples. Theabilitytoproducecorrectoutputsorbehavioronpreviouslyunseeninputsis called GENERALIZATION. rote learning is memorization without generalization. The big question of Learning Theory (and practice): how to get good generalization with a limited numberof examples. Y. LeCun: Machine Learning andPattern Recognition– p. 10/29Overfitting If you can memorise everything then you • have no error signal to learn from, so you can’t improve your model. If you can really memorise everything this • doesn’t matter. “Generalisation isn’t the point of learning. Being right is the point of learning.” – Will Lowe But mostly, it matters. •Spot  the  indicaon   of  a  problem. Polynomial  Coefficients      Another  soluon  (when   you  can’t  get  more  data) Regularizaon Penalize  large  coefficient  valuesRegularizaon:  Regularizaon:  Regularizaon:                      vs.  As per last time... HowBiologyDoesIt Thefirstattemptsatmachinelearninginthe50’s, andthedevelopmentofartificial neuralnetworks inthe80’sand90’swereinspiredbybiology. NervousSystemsarenetworksofneurons interconnectedthroughsynapses Learningandmemoryarechangesinthe “efficacy”ofthesynapses HUGESIMPLIFICATION:aneuroncomputesa weightedsumofitsinputs(wheretheweightsare thesynapticefficacies) andfireswhenthatsum exceedsathreshold. Hebbianlearning(fromHebb,1947): synaptic weightschangeasafunctionofthepreand postsynapticactivities. 3 5 orders of magnitude: each neuron has 10 to 10 synapses. Brain sizes (number of neurons): house 5 6 10 fly: 10 ;mouse: 5.10 ,human: 10 . Y.LeCun: MachineLearningandPatternRecognition–p.12/29Perceptron wikipedia(originally: The Perceptron) TheLinearClassifier Historically, the LinearClassifierwas designed as a highly simplified model of the neuron (McCulloch and Pitts 1943, Rosenblatt 1957): i=N y =f( w x ) i i i=0 Withf is the threshold function: f(z)=1 iff z 0, f(z)=−1 otherwise. x is assumed 0 tobeconstantequalto1,andw isinterpreted 0 as a bias. In vector form: W =(w ,w ....w ),X = 0 1 n (1,x ...x ): 1 n ′ y =f(W X) ′ ThehyperplaneW X=0partitionsthespace in two categories. W is orthogonal to the hy perplane. Y. LeCun: Machine Learning and Pattern Recognition – p. 13/29ASimple Idea forLearning: ErrorCorrection Perceptron Learning Algorithm WehaveatrainingsetSconsistingofP inputoutput 1 1 2 2 P P pairs: S=(X ,y ),(X ,y ),....(X ,y ). A very simple algorithm: show each sample in sequence repetitively if the output is correct: do nothing iftheoutputis1andthedesiredoutput+1: increase the weights whose inputs are positive, decrease the weights whose inputs are negative. if the output is +1 and the desired output 1: de creasetheweightswhoseinputsarepositive,increase the weights whose inputs are negative. More formally, forsample p: p p ′ p w (t+1)= w (t)+(y −f(W X ))x i i i i This simple algorithm is called the Perceptron learn ing procedure (Rosenblatt 1957). Y. LeCun: Machine Learning and Pattern Recognition – p. 15/29Historical Note Our understanding of linear classifiers and • probabilitybased learning came from our attempts to understand what neural networks (NN) could couldn’t do. NN are intuitive, easy, algorithmic • attractive, biologically inspired. But these days, most (not all) real action is • happening in straight maths.Common Learning Algorithm Tricks How much you add or subtract from the • weight determines how fast you learn: learning rate. If you learn too fast you can overshoot the • ideal value, do this a lot and you dither forever. Want learning to converge on right values. •Provably works iff ThePerceptronLearningProcedure linearly separable. Theorem: If the classes are linearly separable (i.e. separable by a hyperplane), then the Perceptron procedure will converge to a solution in a finite numberof steps. ∗ Proof: Let’s denote byW a normalized vectorin the direction of a solution. Suppose p allX are within a ball of radiusR. Without loss of generality, we replace allX p p p whosey is 1 by−X , and set ally to 1. Let us now define the margin ∗ p ∗ M =min W X . Each time there is an error,W.W increases by at least p ∗ ∗ X.W ≥M. This meansW .W ≥NM whereN is the total numberof weight final updates (total numberof errors). But, the change in square magnitude ofW is p bounded by the square magnitude of the current sampleX , which is itself bounded 2 2 2 byR . Therefore, W ≤NR . combining the two inequalities final √ ∗ W .W ≥NM and W ≤ NR, we have final final ∗ W .W /W ≥ (N)M/R final final . Since the left hand side is upperbounded by 1, we deduce Proof by Minsky (long story) 2 2 N≤R /M Y. LeCun: Machine Learning and Pattern Recognition – p. 16/29Neat vs Scruffy How can you be sure your problem is • linearly separable You can’t. Just try it. Scruffy. • Only use provably cool stuff. Neat. •Neats + Scruffies A collection of hacks is more likely to win if • it is motivated by theory – if each hack is a reasonable approximation of what a sound system would do. A systems approach will look for indicators • of fail states for scruffy solutions (e.g. coefficients blowing up earlier.)ASimpleTrick: NearestNeighborMatching Instead of insisting that the input be exactly identical to one of the training samples, let’s compute the “distances” between the input and all the memorized samples (aka the prototypes). 1Nearest NeighborRule: pick the class of the nearest prototype. KNearest NeighborRule: pick the class that has the majority among the K nearest prototypes. PROBLEM: What is the right distance measure PROBLEM: This is horrendously expensive if the numberof prototypes is large. PROBLEM: do we have any guarantee that we get the best possible performance as the number of Problem, problem, training samples increases problem but it Can often also interpolate between works really well. stored solutions (Atkins, Schaal) Y. LeCun: Machine Learning and Pattern Recognition – p. 11/29Single Layer Perceptron Note: Network mutual inhibition “winner take all” WTANeats vs Scruffies: Multilayer Perceptrons NN “learned like people” will solve AI. • Minsky Papert (1969) proved single • layered perceptron networks can’t solve some pretty basic problems. No one knew how to train multilayer • perceptrons, funding dried up, field almost died. AI WinterMulti Layered Perceptron Would solve • the problem But if there’s • an error, which weight caused itNeats vs Scruffies: Backpropagation In the 1980s, several people realised if the • threshold was a sigmoid not a step function, you could assign “credit” across layers using calculus – backpropagation. But then they realised they could do lots of • things with calculus statistics – serious machine learning academics do Bayes now. (Backpropagation is essentially the chain rule.)Backpropagation Geoff Hinton One of the (independent) • backprop inventors. cf. deep learning, Boltzman • MachinesNeats vs Scruffies: Theory vs Practice Serious fast applied stuff e.g. Google do the • serious neat stuff (though sometimes scruffily hacked together). But many, many, many applications of • backpropagation on 3layer networks in ordinary industry by students like you. 2013 “NN still used by psychologists, some • artificial life researchers.”Other Topical NN Compartmental Research models Also 2013 Spike timing networks2014 Deep Mind See also lecture notes…
Website URL
Comment