Decision tree examples ppt

decision tree induction algorithm ppt and classification decision tree induction ppt
Prof.EvanBaros Profile Pic
Prof.EvanBaros,United Kingdom,Teacher
Published Date:26-07-2017
Your Website URL(Optional)
Comment
Decision Trees A hierarchical data structure that represents data by implementing a divide and conquer strategy Can be used as a non-parametric classification and regression method Given a collection of examples, learn a decision tree that represents it. Use this representation to classify new examples B A C 7 DECISION TREES CS446 Spring’17The Representation Decision Trees are classifiers for instances represented as B A feature vectors (color= ; shape= ; label= ) Nodes are tests for feature values C There is one branch for each value of the feature Leaves specify the category (labels) (color= RED ;shape=triangle) Can categorize instances into multiple disjoint categories Evaluation of a Learning a Color Decision Tree Decision Tree Blue red Green Shape Shape B triangle circle circle square square B A B C A 8 DECISION TREES CS446 Spring’17Expressivity of Decision Trees As Boolean functions they can represent any Boolean function. Can be rewritten as rules in Disjunctive Normal Form (DNF)  Greensquare positive  Bluecircle positive  Bluesquare positive The disjunction of these rules is equivalent to the Decision Tree What did we show? What is the hypothesis space here? 9 2 dimensions, 3 values each X = 9; Y = 2; H = 2 Color Blue red Green Shape Shape + triangle circle circle square square + - - + + 9 DECISION TREES CS446 Spring’17Decision Trees Output is a discrete category. Real valued outputs are possible (regression trees) There are efficient algorithms for processing large amounts of data (but not too many features) There are methods for handling noisy data (classification noise and attribute noise) and for handling missing attribute values Color Blue red Green Shape Shape + triangle circle circle square square + - - + + 10 DECISION TREES CS446 Spring’17Decision Boundaries Usually, instances are represented as attribute-value pairs (color=blue, shape = square, +) Numerical values can be used either by discretizing or by using thresholds for splitting nodes In this case, the tree divides the features space into axis-parallel rectangles, each labeled with one of the labels X3 Y + + + no yes 7 Y7 Y5 + + - no yes yes no 5 X 1 + - - + - - + + - 1 3 X 11 DECISION TREES CS446 Spring’17Decision Trees Can represent any Boolean Function Can be viewed as a way to compactly represent a lot of data. Natural representation: (20 questions) The evaluation of the Decision Tree Classifier is easy Outlook Clearly, given data, there are many ways to represent it as Rain Sunny Overcast a decision tree. Humidity Wind Learning a good representation Yes from data is the challenge. High Normal StrongWeak Yes No Yes No 12 DECISION TREES CS446 Spring’17Will I play tennis today? Features  Outlook: Sun, Overcast, Rain  Temperature: Hot, Mild, Cool  Humidity: High, Normal, Low  Wind: Strong, Weak Labels  Binary classification task: Y = +, - 13 DECISION TREES CS446 Spring’17Will I play tennis today? O T H W Play? Outlook: S(unny), 1 S H H W - O(vercast), 2 S H H S - R(ainy) 3 O H H W + 4 R M H W + Temperature: H(ot), 5 R C N W + M(edium), 6 R C N S - C(ool) 7 O C N S + 8 S M H W - Humidity: H(igh), 9 S C N W + N(ormal), 10 R M N W + L(ow) 11 S M N S + 12 O M H S + Wind: S(trong), 13 O H N W + W(eak) 14 R M H S - 14 DECISION TREES CS446 Spring’17Administration Registration Ask NOW Questions Hw1 is out. Due on Friday.  You should be working on it already.  You have noticed that the goal of the Hw is to teach you something.  Discussion sessions will start next week; come ready with questions. Projects  Small (2-3) groups; your choice of a topic.  Anything with a significant Machine Learning component works.  More details will come.  25% of the grade  needs to be a substantial project  Extra credit for undergrads Quiz 2: will be made available over the weekend. Check the website for office hours, discussion sessions etc. 15 DECISION TREES CS446 Spring’17Basic Decision Trees Learning Algorithm Algorithm? Data is processed in Batch (i.e. all the data available) Recursively build a decision tree top down. Outlook O T H W Play? 1 S H H W - 2 S H H S - 3 O H H W + Rain Sunny Overcast 4 R M H W + Humidity Wind 5 R C N W + Yes 6 R C N S - 7 O C N S + High Normal StrongWeak 8 S M H W - Yes Yes No No 9 S C N W + 10 R M N W + 11 S M N S + 12 O M H S + 13 O H N W + 16 DECISION TREES CS446 Spring’17 14 R M H S -Basic Decision Tree Algorithm Let S be the set of Examples  Label is the target attribute (the prediction)  Attributes is the set of measured attributes ID3(S, Attributes, Label) If all examples are labeled the same return a single node tree with Label Otherwise Begin A = attribute in Attributes that best classifies S (Create a Root node for tree) for each possible value v of A Add a new tree branch corresponding to A=v Let Sv be the subset of examples in S with A=v if Sv is empty: add leaf node with the common value of Label in S why? Else: below this branch add the subtree For evaluation time ID3(Sv, Attributes - a, Label) End Return Root 17 DECISION TREES CS446 Spring’17Picking the Root Attribute The goal is to have the resulting decision tree as small as possible (Occam’s Razor)  But, finding the minimal decision tree consistent with the data is NP-hard The recursive algorithm is a greedy heuristic search for a simple tree, but cannot guarantee optimality. The main decision in the algorithm is the selection of the next attribute to condition on. 18 DECISION TREES CS446 Spring’17Picking the Root Attribute Consider data with two Boolean attributes (A,B). (A=0,B=0), - : 50 examples (A=0,B=1), - : 50 examples (A=1,B=0), - : 0 examples (A=1,B=1), + : 100 examples What should be the first attribute we select? A 1 0 Splitting on A: we get purely labeled nodes. - + B 1 0 Splitting on B: we don’t get purely labeled nodes. - A 1 0 What if we have: (A=1,B=0), - : 3 examples - + 19 DECISION TREES CS446 Spring’17Picking the Root Attribute Consider data with two Boolean attributes (A,B). (A=0,B=0), - : 50 examples (A=0,B=1), - : 50 examples (A=1,B=0), - : 0 examples 3 examples (A=1,B=1), + : 100 examples Trees looks structurally similar; which attribute should we choose? B A 1 0 1 0 - A - 1 0 B 1 0 100 53 - + - + Advantage A. But… 100 50 3 100 Need a way to quantify things 20 DECISION TREES CS446 Spring’17Picking the Root Attribute The goal is to have the resulting decision tree as small as possible (Occam’s Razor) The main decision in the algorithm is the selection of the next attribute to condition on. We want attributes that split the examples to sets that are relatively pure in one label; this way we are closer to a leaf node. The most popular heuristics is based on information gain, originated with the ID3 system of Quinlan. 21 DECISION TREES CS446 Spring’17Entropy Entropy (impurity, disorder) of a set of examples, S, relative to a binary classification is: Entropy(S)p log(p ) p log(p )  where P is the proportion of positive examples in S + and P is the proportion of negatives. -  If all the examples belong to the same category: Entropy = 0  If all the examples are equally mixed (0.5, 0.5): Entropy = 1  Entropy = Level of uncertainty. In general, when p is the fraction of examples labeled i: i k Entropy(p ,p ,...p ) p log(p ) 1 2 k i i i1 Entropy can be viewed as the number of bits required, on average, to encode the class of labels. If the probability for + is 0.5, a single bit is required for each example; if it is 0.8 can use less then 1 bit. 22 DECISION TREES CS446 Spring’17Entropy Entropy (impurity, disorder) of a set of examples, S, relative to a binary classification is: Entropy(S)p log(p ) p log(p )  where P is the proportion of positive examples in S + and P is the proportion of negatives. -  If all the examples belong to the same category: Entropy = 0  If all the examples are equally mixed (0.5, 0.5): Entropy = 1  Entropy = Level of uncertainty. 1 1 1 + + + 23 DECISION TREES CS446 Spring’17High Entropy – High level of Uncertainty Entropy Low Entropy – No Uncertainty. Entropy (impurity, disorder) of a set of examples, S, relative to a binary classification is: Entropy(S)p log(p ) p log(p )  where is the proportion of positive examples in S and p  p is the proportion of negatives.  If all the examples belong to the same category: Entropy = 0 If all the examples are equally mixed (0.5, 0.5): Entropy = 1 1 1 1 24 DECISION TREES CS446 Spring’17High Entropy – High level of Outlook Uncertainty Information Gain Low Entropy – No Uncertainty. Sunny Overcast Rain The information gain of an attribute a is the expected reduction in entropy caused by partitioning on this attribute S v Gain(S, a) Entropy(S) Entropy(S )  v S vvalues(a) where S is the subset of S for which attribute a has v value v, and the entropy of partitioning the data is calculated by weighing the entropy of each partition by its size relative to the original set  Partitions of low entropy (imbalanced splits) lead to high gain Go back to check which of the A, B splits is better 25 DECISION TREES CS446 Spring’17Will I play tennis today? O T H W Play? Outlook: S(unny), 1 S H H W - O(vercast), 2 S H H S - R(ainy) 3 O H H W + 4 R M H W + Temperature: H(ot), 5 R C N W + M(edium), 6 R C N S - C(ool) 7 O C N S + 8 S M H W - Humidity: H(igh), 9 S C N W + N(ormal), 10 R M N W + L(ow) 11 S M N S + 12 O M H S + Wind: S(trong), 13 O H N W + W(eak) 14 R M H S - 26 DECISION TREES CS446 Spring’17

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.