Decision tree algorithm ppt

decision tree diagram powerpoint and decision tree induction algorithm ppt
Prof.KristianHardy Profile Pic
Prof.KristianHardy,Austria,Teacher
Published Date:26-07-2017
Your Website URL(Optional)
Comment
Tree-based Methods  Here we describe tree-based methods for regression and classi cation.  These involve stratifying or segmenting the predictor space into a number of simple regions.  Since the set of splitting rules used to segment the predictor space can be summarized in a tree, these types of approaches are known as decision-tree methods. 1/51Pros and Cons  Tree-based methods are simple and useful for interpretation.  However they typically are not competitive with the best supervised learning approaches in terms of prediction accuracy.  Hence we also discuss bagging, random forests, and boosting. These methods grow multiple trees which are then combined to yield a single consensus prediction.  Combining a large number of trees can often result in dramatic improvements in prediction accuracy, at the expense of some loss interpretation. 2/51The Basics of Decision Trees  Decision trees can be applied to both regression and classi cation problems.  We rst consider regression problems, and then move on to classi cation. 3/51Baseball salary data: how would you stratify it? Salary is color-coded from low (blue, green) to high (yellow,red) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● 5 10 15 20 Years 4/51 Hits 0 50 100 150 200Decision tree for these data Years 4.5 Hits 117.5 5.11 6.00 6.74 5/51Details of previous gure  For the Hitters data, a regression tree for predicting the log salary of a baseball player, based on the number of years that he has played in the major leagues and the number of hits that he made in the previous year.  At a given internal node, the label (of the form X t ) j k indicates the left-hand branch emanating from that split, and the right-hand branch corresponds to X t . For instance, the j k split at the top of the tree results in two large branches. The left-hand branch corresponds to Years4.5, and the right-hand branch corresponds to Years=4.5.  The tree has two internal nodes and three terminal nodes, or leaves. The number in each leaf is the mean of the response for the observations that fall there. 6/51Results  Overall, the tree strati es or segments the players into three regions of predictor space: R =fXj Years 4:5g, 1 R =fXj Years=4.5, Hits117.5g, and R =fXj 2 3 Years=4.5, Hits=117.5g. 238 R 3 R 117.5 1 R 2 1 1 4.5 24 Years 7/51 HitsTerminology for Trees  In keeping with the tree analogy, the regions R , R , and 1 2 R are known as terminal nodes 3  Decision trees are typically drawn upside down, in the sense that the leaves are at the bottom of the tree.  The points along the tree where the predictor space is split are referred to as internal nodes  In the hitters tree, the two internal nodes are indicated by the text Years4.5 and Hits117.5. 8/51Interpretation of Results  Years is the most important factor in determining Salary, and players with less experience earn lower salaries than more experienced players.  Given that a player is less experienced, the number of Hits that he made in the previous year seems to play little role in his Salary.  But among players who have been in the major leagues for ve or more years, the number of Hits made in the previous year does a ect Salary, and players who made more Hits last year tend to have higher salaries.  Surely an over-simpli cation, but compared to a regression model, it is easy to display, interpret and explain 9/51Details of the tree-building process 1. We divide the predictor space that is, the set of possible values for X ;X ;:::;X into J distinct and 1 2 p non-overlapping regions, R ;R ;:::;R . 1 2 J 2. For every observation that falls into the region R , we j make the same prediction, which is simply the mean of the response values for the training observations in R . j 10/51More details of the tree-building process  In theory, the regions could have any shape. However, we choose to divide the predictor space into high-dimensional rectangles, or boxes, for simplicity and for ease of interpretation of the resulting predictive model.  The goal is to nd boxes R ;:::;R that minimize the 1 J RSS, given by J XX 2 (y y ) ; i R j j=1i2R j where y is the mean response for the training R j observations within the jth box. 11/51More details of the tree-building process  Unfortunately, it is computationally infeasible to consider every possible partition of the feature space into J boxes.  For this reason, we take a top-down, greedy approach that is known as recursive binary splitting.  The approach is top-down because it begins at the top of the tree and then successively splits the predictor space; each split is indicated via two new branches further down on the tree.  It is greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step. 12/51Details Continued  We rst select the predictor X and the cutpoint s such j that splitting the predictor space into the regions fXjX sg andfXjX sg leads to the greatest possible j j reduction in RSS.  Next, we repeat the process, looking for the best predictor and best cutpoint in order to split the data further so as to minimize the RSS within each of the resulting regions.  However, this time, instead of splitting the entire predictor space, we split one of the two previously identi ed regions. We now have three regions.  Again, we look to split one of these three regions further, so as to minimize the RSS. The process continues until a stopping criterion is reached; for instance, we may continue until no region contains more than ve observations. 13/51Predictions  We predict the response for a given test observation using the mean of the training observations in the region to which that test observation belongs.  A ve-region example of this approach is shown in the next slide. 14/51R 5 R2 t4 R3 t R 2 4 R 1 t1 t3 X X 1 1 X ≤t 1 1 X2≤t2 X ≤t 1 3 X2≤t4 R R R 1 2 3 X 2 X 1 R R 4 5 15/51 X 2 X 2Details of previous gure Top Left: A partition of two-dimensional feature space that could not result from recursive binary splitting. Top Right: The output of recursive binary splitting on a two-dimensional example. Bottom Left: A tree corresponding to the partition in the top right panel. Bottom Right: A perspective plot of the prediction surface corresponding to that tree. 16/51 A smaller tree with fewer splits (that is, fewer regions R ;:::;R ) might lead to lower variance and better 1 J interpretation at the cost of a little bias.  One possible alternative to the process described above is to grow the tree only so long as the decrease in the RSS due to each split exceeds some (high) threshold.  This strategy will result in smaller trees, but is too short-sighted: a seemingly worthless split early on in the tree might be followed by a very good split that is, a split that leads to a large reduction in RSS later on. Pruning a tree  The process described above may produce good predictions on the training set, but is likely to over t the data, leading to poor test set performance.Why? 17/51Pruning a tree  The process described above may produce good predictions on the training set, but is likely to over t the data, leading to poor test set performance.Why?  A smaller tree with fewer splits (that is, fewer regions R ;:::;R ) might lead to lower variance and better 1 J interpretation at the cost of a little bias.  One possible alternative to the process described above is to grow the tree only so long as the decrease in the RSS due to each split exceeds some (high) threshold.  This strategy will result in smaller trees, but is too short-sighted: a seemingly worthless split early on in the tree might be followed by a very good split that is, a split that leads to a large reduction in RSS later on. 17/51Pruning a tree continued  A better strategy is to grow a very large tree T , and then 0 prune it back in order to obtain a subtree  Cost complexity pruning also known as weakest link pruning is used to do this  we consider a sequence of trees indexed by a nonnegative tuning parameter . For each value of there corresponds a subtree TT such that 0 jTj X X 2 (y y ) + jTj i R m m=1i:x2R m i is as small as possible. HerejTj indicates the number of terminal nodes of the tree T , R is the rectangle (i.e. the m subset of predictor space) corresponding to the mth terminal node, and y is the mean of the training R m observations in R . m 18/51Choosing the best subtree  The tuning parameter controls a trade-o between the subtree's complexity and its t to the training data.  We select an optimal value using cross-validation.  We then return to the full data set and obtain the subtree corresponding to . 19/51

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.