What are Decision trees in Data mining

what are decision trees in artificial intelligence, what are boosted decision trees, how decision trees work, how do decision trees work how are decision trees useful
Dr.BenjaminClark Profile Pic
Dr.BenjaminClark,United States,Teacher
Published Date:21-07-2017
Your Website URL(Optional)
Comment
C CS SE E 4 47 73 3 C Ch ha ap pt te er r 1 18 8 D De ec ci is si io on n T Tr re ee es s a an nd d Ensemble Learning Ensemble Learning Recall: Learning Decision Trees Example: When should I wait for a table at a restaurant? Attributes (features) relevant to Wait? decision: 1. Alternate: is there an alternative restaurant nearby? 2. Bar: is there a comfortable bar area to wait in? 3. Fri/Sat: is today Friday or Saturday? 4. Hungry: are we hungry? 5. Patrons: number of people in the restaurant (None, Some, Full) 6. Price: price range (, , ) 7. Raining: is it raining outside? 8. Reservation: have we made a reservation? 9. Type: kind of restaurant (French, Italian, Thai, Burger) 10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, 60) © CSE AI Faculty 2 1Example Decision tree A decision tree for Wait? based on personal “rules of thumb” (this was used to generate input data): © CSE AI Faculty 3 Input Data for Learning • Past examples where I did/did not wait for a table: • Classification of examples is positive (T) or negative (F) © CSE AI Faculty 4 2Decision Tree Learning • Aim: find a small tree consistent with training examples • Idea: (recursively) choose "most significant" attribute as root of (sub)tree © CSE AI Faculty 5 Choosing an attribute to split on • Idea: a good attribute should reduce uncertainty E.g., splits the examples into subsets that are (ideally) "all positive" or "all negative" • Patrons? is a better choice To wait or not to wait is still at 50%. © CSE AI Faculty 6 3How do we quantify uncertainty? Using information theory to quantify uncertainty • Entropy measures the amount of uncertainty in a probability distribution • Entropy (or Information Content) of an answer to a question with possible answers v , … , v : 1 n I(P(v ), … , P(v )) = -P(v ) log P(v ) 1 n i=1 i 2 i © CSE AI Faculty 8 4Using information theory • Imagine we have p examples with Wait = True (positive) and n examples with Wait = false (negative). • Our best estimate of the probabilities of Wait = true or false is given by: P (true) ≈ p / p + n p(false) ≈ n / p + n • Hence the entropy of Wait is given by: p n p p n n I( , ) = - log - log 2 2 p +n p +n p +n p +n p +n p +n © CSE AI Faculty 9 Entropy is 1.0 highest when uncertainty is greatest 0.5 Wait = F Wait = T P(Wait = T) .00 .50 1.00 © CSE AI Faculty 10 5 Entropy IChoosing an attribute to split on • Idea: a good attribute should reduce uncertainty and result in “gain in information” • How much information do we gain if we disclose the value of some attribute? • Answer: uncertainty before – uncertainty after © CSE AI Faculty 11 Back at the Restaurant Before choosing an attribute: Entropy = - 6/12 log(6/12) – 6/12 log(6/12) = - log(1/2) = log(2) = 1 bit There is “1 bit of information to be discovered” © CSE AI Faculty 12 6Back at the Restaurant If we choose Type: Go along branch “French”: we have entropy = 1 bit; similarly for the others. Information gain = 1-1 = 0 along any branch If we choose Patrons: In branch “None” and “Some”, entropy = 0 For “Full”, entropy = -2/6 log(2/6)-4/6 log(4/6) = 0.92 Info gain = (1-0) or (1-0.64) bits 0 in both cases So choosing Patrons gains more information © CSE AI Faculty 13 Entropy across branches • How do we combine entropy of different branches? • Answer: Computed average entropy • Weight entropies according to probabilities of branches 2/12 times we enter “None”, so weight for “None” = 1/6 “Some” has weight: 4/12 = 1/3 “Full” has weight 6/12 = ½ n p +n p n i i i i E AnvtgrE on ptyr( oA py ) = Entropy ( , ) ∑ p + n p + n p + n i =1 i i i i entropy for each branch weight for each branch © CSE AI Faculty 14 7Information gain • Information Gain (IG) or reduction in entropy from using attribute A: IG(A) = Entropy before – AvgEntropy after choosing A • Choose the attribute with the largest IG © CSE AI Faculty 15 Information gain in our example 2 4 6 2 4 IG(Patrons) =1- I(0,1) + I(1,0) + I( , ) = .541 bits 12 12 12 6 6 2 1 1 2 1 1 4 2 2 4 2 2 IG(Type) =1- I( , ) + I( , ) + I( , ) + I( , ) = 0 bits 12 2 2 12 2 2 12 4 4 12 4 4 Patrons has the highest IG of all attributes ⇒ Chosen by the DTL algorithm as the root © CSE AI Faculty 16 8Should I stay or should I go? Learned Decision Tree • Decision tree learned from the 12 examples: • Substantially simpler than other tree more complex hypothesis not justified by small amount of data © CSE AI Faculty 17 Performance Measurement How do we know that the learned tree h ≈ f ? Answer: Try h on a new test set of examples Learning curve = % correct on test set as a function of training set size © CSE AI Faculty 18 9Ensemble Learning • Sometimes each learning technique yields a different hypothesis (or function) • But no perfect hypothesis… • Could we combine several imperfect hypotheses to get a better hypothesis? © CSE AI Faculty 19 Example 1: Othello Project Many brains better than one? © CSE AI Faculty 20 10Example 2 Combining 3 linear classifiers ⇒ More complex classifier this line is one simple classifier saying that everything to the left is + and everything to the right is - © CSE AI Faculty 21 Ensemble Learning: Motivation • Analogies: Elections combine voters’ choices to pick a good candidate (hopefully) Committees combine experts’ opinions to make better decisions Students working together on Othello project • Intuitions: Individuals make mistakes but the “majority” may be less likely to (true for Othello? We shall see…) Individuals often have partial knowledge; a committee can pool expertise to make better decisions © CSE AI Faculty 22 11Technique 1: Bagging • Combine hypotheses via majority voting © CSE AI Faculty 23 Bagging: Analysis Error probability went down from 0.1 to 0.01 © CSE AI Faculty 24 12Weighted Majority Voting • In practice, hypotheses rarely independent • Some hypotheses have less errors than others ⇒ all votes are not equal • Idea: Let’s take a weighted majority © CSE AI Faculty 25 Technique 2: Boosting • Most popular ensemble learning technique Computes a weighted majority of hypotheses Can “boost” performance of a “weak learner” • Operates on a weighted training set Each training example (instance) has a “weight” Learning algorithm takes weight of input into account • Idea: when an input is misclassified by a hypothesis, increase its weight so that the next hypothesis is more likely to classify it correctly © CSE AI Faculty 26 13Boosting Example with Decision Trees (DTs) training case correctly classified training case has large weight in this round this DT has a strong vote. Output of h is weighted majority of outputs of h ,…,h final 1 4 © CSE AI Faculty 27 AdaBoost Algorithm © CSE AI Faculty 28 14AdaBoost Example Original training set D : Equal weights to all training inputs 1 Goal: In round t, learn classifier h that minimizes error with t respect to weighted training set h maps input to True (+1) or False (-1) t Taken from “A Tutorial on Boosting” by Yoav Freund and Rob Schapire © CSE AI Faculty 29 AdaBoost Example ROUND 1 Misclassified Increase weights z = 0.42 1 © CSE AI Faculty 30 15AdaBoost Example ROUND 2 z = 0.65 2 © CSE AI Faculty 31 AdaBoost Example ROUND 3 z = 0.92 3 © CSE AI Faculty 32 16AdaBoost Example h final sign(x) = +1 if x 0 and -1 otherwise © CSE AI Faculty 33 Next Time • Classification using: Nearest Neighbors Neural Networks © CSE AI Faculty 34 17

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.