Question? Leave a message!




Multi-Layer Artificial Neural Networks (ANNs)

Multi-Layer Artificial Neural Networks (ANNs)
MultiLayer Artificial Neural Networks (ANNs) www.ThesisScientist.comMultiLayer Networks Built from Perceptron Units  Perceptrons not able to learn certain concepts – Can only learn linearly separable functions  But they can be the basis for larger structures – Which can learn more sophisticated concepts – Say that the networks have “perceptron units” www.ThesisScientist.comProblem With Perceptron Units  The learning rule relies on differential calculus – Finding minima by differentiating, etc.  Step functions aren’t differentiable – They are not continuous at the threshold  Alternative threshold function sought – Must be differentiable – Must be similar to step function  i.e., exhibit a threshold so that units can “fire” or not fire  Sigmoid units used for backpropagation – There are other alternatives that are often used www.ThesisScientist.comSigmoid Units  Take in weighted sum of inputs, S and output:  Advantages: – Looks very similar to the step function – Is differentiable – Derivative easily expressible in terms of σ itself: www.ThesisScientist.comExample ANN with Sigmoid Units  Feed forward network – Feed inputs in on the left, propagate numbers forward  Suppose we have this ANN – With weights set arbitrarily www.ThesisScientist.comPropagation of Example  Suppose input to ANN is 10, 30, 20  First calculate weighted sums to hidden layer: – S = (0.210) + (0.130) + (0.420) = 23+8 = 7 H1 – S = (0.710) + (1.230) + (1.220) = 76+24= 5 H2  Next calculate the output from the hidden layer: S – Using: σ(S) = 1/(1 + e ) 7 – σ(S ) = 1/(1 + e ) = 1/(1+0.000912) = 0.999 H1 5 – σ(S ) = 1/(1 + e ) = 1/(1+148.4) = 0.0067 H2 – So, H1 has fired, H2 has not www.ThesisScientist.comPropagation of Example  Next calculate the weighted sums into the output layer: – S = (1.1 0.999) + (0.1 0.0067) = 1.0996 O1 – S = (3.1 0.999) + (1.17 0.0067) = 3.1047 O2  Finally, calculate the output from the ANN 1.0996 – σ(S ) = 1/(1+e ) = 1/(1+0.333) = 0.750 O1 3.1047 – σ(S ) = 1/(1+e ) = 1/(1+0.045) = 0.957 O2  Output from O2 output from O1 – So, the ANN predicts category associated with O2 – For the example input (10,30,20) www.ThesisScientist.comBackpropagation Learning Algorithm  Same task as in perceptrons – Learn a multilayer ANN to correctly categorise unseen examples – We’ll concentrate on ANNs with one hidden layer  Overview of the routine – Fix architecture and sigmoid units within architecture  i.e., number of units in hidden layer; the way the input units represent example; the way the output units categorises examples – Randomly assign weights to the the whole network  Use small values (between –0.5 and 0.5) – Use each example in the set to retrain the weights – Have multiple epochs (iterations through training set)  Until some termination condition is met (not necessarily 100 acc) www.ThesisScientist.comWeight Training Calculations (Overview)  Use notation w to specify: ij – Weight between unit i and unit j  Look at the calculation with respect to example E  Going to calculate a value Δ for each w ij ij – And add Δ on to w ij ij  Do this by calculating error terms for each unit  The error term for output units is found – And then this information is used to calculate the error terms for the hidden units  So, the error is propagated back through the ANN www.ThesisScientist.comPropagate E through the Network  Feed E through the network (as in example above)  Record the target and observed values for example E – i.e., determine weighted sum from hidden units, do sigmoid calc – Let t (E) be the target values for output unit i i – Let o (E) be the observed value for output unit i i  Note that for categorisation learning tasks, – Each t (E) will be 0, except for a single t (E), which will be 1 i j – But o (E) will be a real valued number between 0 and 1 i  Also record the outputs from the hidden units – Let h (E) be the output from hidden unit i i www.ThesisScientist.comError terms for each unit  The Error Term for output unit k is calculated as:  The Error Term for hidden unit k is:  In English: – For hidden unit h, add together all the errors for the output units, multiplied by the appropriate weight. – Then multiply this sum by h (E)(1 – h (E)) k k www.ThesisScientist.comFinal Calculations  Choose a learning rate, η (= 0.1 again, perhaps)  For each weight w ij – Between input unit i and hidden unit j – Calculate: – Where x is the input to the system to input unit i for E i  For each weight w ij – Between hidden unit i and output unit j – Calculate: – Where h (E) is the output from hidden unit i for E i  Finally, add on each Δ on to w ij ij www.ThesisScientist.comWorked Backpropagation Example  Start with the previous ANN  We will retrain the weights – In the light of example E = (10,30,20) – Stipulate that E should have been categorised as O1 – Will use a learning rate of η = 0.1 www.ThesisScientist.comPrevious Calculations  Need the calculations from when we propagated E through the ANN:  t (E) = 1 and t (E) = 0 from categorisation 1 2  o (E) = 0.750 and o (E) = 0.957 1 2 www.ThesisScientist.comError Values for Output Units  t (E) = 1 and t (E) = 0 from categorisation 1 2  o (E) = 0.750 and o (E) = 0.957 1 2  So: www.ThesisScientist.comError Values for Hidden Units  δ = 0.0469 and δ = 0.0394 O1 O2  h (E) = 0.999 and h (E) = 0.0067 1 2  So, for H1, we add together: – (w δ ) + (w δ ) = (1.10.0469)+(3.10.0394) = 0.0706 11 01 12 O2 – And multiply by: h (E)(1h (E)) to give us: 1 1  0.0706 (0.999 (10.999)) = 0.0000705 = δ H1  For H2, we add together: – (w δ ) + (w δ ) = (0.10.0469)+(1.170.0394) = 0.0414 21 01 22 O2 – And multiply by: h (E)(1h (E)) to give us: 2 2  0.0414 (0.067 (10.067)) = 0.00259= δ www.ThesisScientist.com H2Calculation of Weight Changes  For weights between the input and hidden layer www.ThesisScientist.comCalculation of Weight Changes  For weights between hidden and output layer  Weight changes are not very large – Small differences in weights can make big differences in calculations – But it might be a good idea to increase η www.ThesisScientist.comCalculation of Network Error  Could calculate Network error as – Proportion of miscategorised examples  But there are multiple output units, with numerical output – So we use a more sophisticated measure:  Not as complicated as it looks – Square the difference between target and observed  Squaring ensures we get a positive number  Add up all the squared differences – For every output unit and every example in training set www.ThesisScientist.comProblems with Local Minima  Backpropagation is gradient descent search – Where the height of the hills is determined by error – But there are many dimensions to the space  One for each weight in the network  Therefore backpropagation – Can find its ways into local minima  One partial solution: – Random restart: learn lots of networks  Starting with different random weight settings – Can take best network – Or can set up a “committee” of networks to categorise examples  Another partial solution: Momentum www.ThesisScientist.comAdding Momentum  Imagine rolling a ball down a hill Gets stuck here Without Momentum With Momentum www.ThesisScientist.comMomentum in Backpropagation  For each weight – Remember what was added in the previous epoch  In the current epoch – Add on a small amount of the previous Δ  The amount is determined by – The momentum parameter, denoted α – α is taken to be between 0 and 1 www.ThesisScientist.comHow Momentum Works  If direction of the weight doesn’t change – Then the movement of search gets bigger – The amount of additional extra is compounded in each epoch – May mean that narrow local minima are avoided – May also mean that the convergence rate speeds up  Caution: – May not have enough momentum to get out of local minima – Also, too much momentum might carry search  Back out of the global minimum, into a local minimum www.ThesisScientist.comProblems with Overfitting  Plot training example error versus test example error:  Test set error is increasing – Network is overfitting the data – Learning idiosyncrasies in data, not general principles – Big problem in Machine Learning (ANNs in particular) www.ThesisScientist.comAvoiding Overfitting  Bad idea to use training set accuracy to terminate  One alternative: Use a validation set – Hold back some of the training set during training – Like a miniature test set (not used to train weights at all) – If the validation set error stops decreasing, but the training set error continues decreasing  Then it’s likely that overfitting has started to occur, so stop – Be careful, because validation set error could get into a local minima itself  Worthwhile running the training for longer, and wait and see  Another alternative: use a weight decay factor – Take a small amount off every weight after each epoch – Networks with smaller weights aren’t as highly fine tuned (overfit) www.ThesisScientist.comSuitable Problems for ANNs  Examples and target categorisation – Can be expressed as real values  ANNs are just fancy numerical functions  Predictive accuracy is more important – Than understanding what the machine has learned  Black box nonsymbolic approach, not easy to digest  Slow training times are OK – Can take hours and days to train networks  Execution of learned function must be quick – Learned networks can categorise very quickly  Very useful in time critical situations (is that a tank, car or old lady)  Error: ANNs are fairly robust to noise in data www.ThesisScientist.com
Website URL
Comment