Question? Leave a message!

Decision Tree Learning

Decision Tree Learning
Dr.BenjaminClark Profile Pic
Dr.BenjaminClark,United States,Teacher
Published Date:21-07-2017
Website URL
Artificial Intelligence 11. Decision Tree Learning www.ThesisScientist.comWhat to do this Weekend?  If my parents are visiting – We’ll go to the cinema  If not – Then, if it’s sunny I’ll play tennis – But if it’s windy and I’m rich, I’ll go shopping – If it’s windy and I’m poor, I’ll go to the cinema – If it’s rainy, I’ll stay in www.ThesisScientist.comWritten as a Decision Tree Root of tree Leaves www.ThesisScientist.comUsing the Decision Tree (No parents on a Sunny Day) www.ThesisScientist.comFrom Decision Trees to Logic  Decision trees can be written as – Horn clauses in first order logic  Read from the root to every tip – If this and this and this … and this, then do this  In our example: – If no_parents and sunny_day, then play_tennis – no_parents  sunny_day  play_tennis www.ThesisScientist.comDecision Tree Learning Overview  Decision tree can be seen as rules for performing a categorisation – E.g., “what kind of weekend will this be?”  Remember that we’re learning from examples – Not turning thought processes into decision trees  We need examples put into categories  We also need attributes for the examples – Attributes describe examples (background knowledge) – Each attribute takes only a finite set of values www.ThesisScientist.comThe ID3 Algorithm - Overview  The major question in decision tree learning – Which nodes to put in which positions – Including the root node and the leaf nodes  ID3 uses a measure called Information Gain – Based on a notion of entropy  “Impurity in the data” – Used to choose which node to put in next  Node with the highest information gain is chosen – When there are no choices, a leaf node is put on www.ThesisScientist.comEntropy – General Idea  From Tom Mitchell’s book: – “In order to define information gain precisely, we begin by defining a measure commonly used in information theory, called entropy that characterizes the (im)purity of an arbitrary collection of examples”  Want a notion of impurity in data  Imagine a set of boxes and balls in them  If all balls are in one box – This is nicely ordered – so scores low for entropy  Calculate entropy by summing over all boxes – Boxes with very few in scores low – Boxes with almost all examples in scores low www.ThesisScientist.comEntropy - Formulae  Given a set of examples, S  For examples in a binary categorisation – Where p is the proportion of positives + – And p is the proportion of negatives -  For examples in categorisations c to c 1 n – Where p is the proportion of examples in c n n www.ThesisScientist.comEntropy - Explanation  Each category adds to the whole measure  When p is near to 1 i – (Nearly) all the examples are in this category  So it should score low for its bit of the entropy – log (p ) gets closer and closer to 0 2 i  And this part dominates the overall calculation  So the overall calculation comes to nearly 0 (which is good)  When p is near to 0 i – (Very) few examples are in this category  So it should score low for its bit of the entropy – log (p ) gets larger (more negative), but does not dominate 2 i – Hence overall calculation comes to nearly 0 (which is good) www.ThesisScientist.comInformation Gain  Given set of examples S and an attribute A – A can take values v … v 1 m – Let S = examples which take value v for attribute A v  Calculate Gain(S,A) – Estimates the reduction in entropy we get if we know the value of attribute A for the examples in S www.ThesisScientist.comAn Example Calculation of Information Gain  Suppose we have a set of examples – S = s , s , s , s 1 2 3 4 – In a binary categorisation  With one positive example and three negative examples  The positive example is s 1  And Attribute A – Which takes values v , v , v 1 2 3  S takes value v for A, S takes value v for A 1 2 2 2 S takes value v for A, S takes value v for A 3 3 4 1 www.ThesisScientist.comFirst Calculate Entropy(S)  Recall that Entropy(S) = -p log (p ) – p log (p ) + 2 + - 2 -  From binary categorisation, we know that p = ¼ and p = ¾ + -  Hence Entropy(S) = -(1/4)log (1/4) – (3/4)log (3/4) 2 2 = 0.811  Note for users of old calculators: – May need to use the fact that log (x) = ln(x)/ln(2) 2  And also note that, by convention: 0log (0) is taken to be 0 2 www.ThesisScientist.comCalculate Gain for each Value of A  Remember that  And that S = set of example with value V for A v – So, S = s , S = s ,s , S =s v1 4 v2 1 2 v3 3  Now, (S /S) Entropy(S ) v1 v1 = (1/4) (-(0/1)log (0/1)-(1/1)log (1/1)) 2 2 = (1/4) (0 - (1)log (1)) = (1/4)(0-0) = 0 2  Similarly, (S /S) = 0.5 and (S /S) = 0 v2 v3 www.ThesisScientist.comFinal Calculation  So, we add up the three calculations and take them from the overall entropy of S:  Final answer for information gain: – Gain(S,A) = 0.811 – (0+1/2+0) = 0.311 www.ThesisScientist.comThe ID3 Algorithm  Given a set of examples, S – Described by a set of attributes A i – Categorised into categories c j 1. Choose the root node to be attribute A – Such that A scores highest for information gain  Relative to S, i.e., gain(S,A) is the highest over all attributes 2. For each value v that A can take – Draw a branch and label each with corresponding v  Then see the options in the next slide www.ThesisScientist.comThe ID3 Algorithm  For each branch you’ve just drawn (for value v) – If S only contains examples in category c v  Then put that category as a leaf node in the tree – If S is empty v  Then find the default category (which contains the most examples from S) – Put this default category as a leaf node in the tree – Otherwise  Remove A from attributes which can be put into nodes  Replace S with S v  Find new attribute A scoring best for Gain(S, A)  Start again at part 2  Make sure you replace S with S v www.ThesisScientist.comExplanatory Diagram www.ThesisScientist.comA Worked Example Weekend Weather Parents Money Decision (Category) W1 Sunny Yes Rich Cinema W2 Sunny No Rich Tennis W3 Windy Yes Rich Cinema W4 Rainy Yes Poor Cinema W5 Rainy No Rich Stay in W6 Rainy Yes Poor Cinema W7 Windy No Poor Cinema W8 Windy No Rich Shopping W9 Windy Yes Rich Cinema W10 Sunny No Rich Tennis www.ThesisScientist.comInformation Gain for All of S  S = W1,W2,…,W10  Firstly, we need to calculate: – Entropy(S) = … = 1.571 (see notes)  Next, we need to calculate information gain – For all the attributes we currently have available  (which is all of them at the moment) – Gain(S, weather) = … = 0.7 – Gain(S, parents) = … = 0.61 – Gain(S, money) = … = 0.2816  Hence, the weather is the first attribute to split on – Because this gives us the biggest information gain