Classification decision trees

classification basic concepts decision trees and model evaluation and classification by decision tree induction examples
Dr.AldenCutts Profile Pic
Dr.AldenCutts,United Kingdom,Teacher
Published Date:23-07-2017
Your Website URL(Optional)
Comment
4 Classification: Basic Concepts, Decision Trees, and Model Evaluation Classification, which is the task of assigning objects to one of several predefined categories, is a pervasive problem that encompasses many diverse applications. Examples include detecting spam email messages based upon the message header and content, categorizing cells as malignant or benign based upon the results of MRI scans, and classifying galaxies based upon their shapes (see Figure 4.1). (a) A spiral galaxy. (b) An elliptical galaxy. Figure 4.1. Classification of galaxies. The images are from the NASA website.146 Chapter 4 Classification Input Output Classification Attribute set Class label model (x) (y) Figure 4.2. Classification as the task of mapping an input attribute setx into its class label y. This chapter introduces the basic concepts of classification, describes some of the key issues such as model overfitting, and presents methods for evaluating and comparing the performance of a classification technique. While it focuses mainly on a technique known as decision tree induction, most of the discussion in this chapter is also applicable to other classification techniques, many of which are covered in Chapter 5. 4.1 Preliminaries The input data for a classification task is a collection of records. Each record, also known as an instance or example, is characterized by a tuple (x,y), where x is the attribute set and y is a special attribute, designated as the class label (also known as category or target attribute). Table 4.1 shows a sample data set used for classifying vertebrates into one of the following categories: mammal, bird, fish, reptile, or amphibian. The attribute set includes properties of a vertebrate such as its body temperature, skin cover, method of reproduction, ability to fly, and ability to live in water. Although the attributes presented in Table 4.1 are mostly discrete, the attribute set can also contain continuous features. The class label, on the other hand, must be a discrete attribute. This is a key characteristic that distinguishes classification from regression, a predictive modeling task in which y is a continuous attribute. Regression techniques are covered in Appendix D. Definition 4.1 (Classification). Classification is the task of learning a tar- get function f that maps each attribute set x to one of the predefined class labels y. The target function is also known informally as a classification model. A classification model is useful for the following purposes. Descriptive Modeling A classification model can serve as an explanatory tool to distinguish between objects of different classes. For example, it would be useful—for both biologists and others—to have a descriptive model that4.1 Preliminaries 147 Table 4.1. The vertebrate data set. Name Body Skin Gives Aquatic Aerial Has Hiber- Class Temperature Cover Birth Creature Creature Legs nates Label human warm-blooded hair yes no no yes no mammal python cold-blooded scales no no no no yes reptile salmon cold-blooded scales no yes no no no fish whale warm-blooded hair yes yes no no no mammal frog cold-blooded none no semi no yes yes amphibian komodo cold-blooded scales no no no yes no reptile dragon bat warm-blooded hair yes no yes yes yes mammal pigeon warm-blooded feathers no no yes yes no bird cat warm-blooded fur yes no no yes no mammal leopard cold-blooded scales yes yes no no no fish shark turtle cold-blooded scales no semi no yes no reptile penguin warm-blooded feathers no semi no yes no bird porcupine warm-blooded quills yes no no yes yes mammal eel cold-blooded scales no yes no no no fish salamander cold-blooded none no semi no yes yes amphibian summarizes the data shown in Table 4.1 and explains what features define a vertebrate as a mammal, reptile, bird, fish, or amphibian. Predictive Modeling A classification model can also be used to predict the class label of unknown records. As shown in Figure 4.2, a classification model can be treated as a black box that automatically assigns a class label when presented with the attribute set of an unknown record. Suppose we are given the following characteristics of a creature known as a gila monster: Name Body Skin Gives Aquatic Aerial Has Hiber- Class Temperature Cover Birth Creature Creature Legs nates Label gila monster cold-blooded scales no no no yes yes ? We can use a classification model built from the data set shown in Table 4.1 to determine the class to which the creature belongs. Classification techniques are most suited for predicting or describing data sets with binary or nominal categories. They are less effective for ordinal categories (e.g., to classify a person as a member of high-, medium-, or low- income group) because they do not consider the implicit order among the categories. Other forms of relationships, such as the subclass–superclass re- lationships among categories (e.g., humans and apes are primates, which in148 Chapter 4 Classification turn, is a subclass of mammals) are also ignored. The remainder of this chapter focuses only on binary or nominal class labels. 4.2 General Approach to Solving a Classification Problem A classification technique (or classifier) is a systematic approach to building classification models from an input data set. Examples include decision tree classifiers, rule-based classifiers, neural networks, support vector machines, and na¨ ıve Bayes classifiers. Each technique employs a learning algorithm to identify a model that best fits the relationship between the attribute set and class label of the input data. The model generated by a learning algorithm should both fit the input data well and correctly predict the class labels of records it has never seen before. Therefore, a key objective of the learning algorithm is to build models with good generalization capability; i.e., models that accurately predict the class labels of previously unknown records. Figure 4.3 shows a general approach for solving classification problems. First, a training set consisting of records whose class labels are known must Training Set Tid Attrib1 Attrib2 Attrib3 Class Learning 1 Yes Large 125K No Algorithm 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No Induction 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No Learn 8 No Small 85K Yes Model 9 No Medium 75K No 10 No Small 90K Yes Model Apply Test Set Model Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? Deduction 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Figure 4.3. General approach for building a classification model.4.2 General Approach to Solving a Classification Problem 149 Table 4.2. Confusion matrix for a 2-class problem. Predicted Class Class=1 Class=0 Actual Class=1 f f 11 10 Class Class=0 f f 01 00 be provided. The training set is used to build a classification model, which is subsequently applied to the test set, which consists of records with unknown class labels. Evaluation of the performance of a classification model is based on the counts of test records correctly and incorrectly predicted by the model. These counts are tabulated in a table known as a confusion matrix. Table 4.2 depicts the confusion matrix for a binary classification problem. Each entry f in this table denotes the number of records from class i predicted to be ij of class j. For instance, f is the number of records from class 0 incorrectly 01 predicted as class 1. Based on the entries in the confusion matrix, the total number of correct predictions made by the model is (f + f ) and the total 11 00 number of incorrect predictions is (f + f ). 10 01 Although a confusion matrix provides the information needed to determine how well a classification model performs, summarizing this information with a single number would make it more convenient to compare the performance of different models. This can be done using a performance metric such as accuracy, which is defined as follows: Number of correct predictions f + f 11 00 Accuracy = = . (4.1) Total number of predictions f + f + f + f 11 10 01 00 Equivalently, the performance of a model can be expressed in terms of its error rate, which is given by the following equation: Number of wrong predictions f + f 10 01 Error rate = = . (4.2) Total number of predictions f + f + f + f 11 10 01 00 Most classification algorithms seek models that attain the highest accuracy, or equivalently, the lowest error rate when applied to the test set. We will revisit the topic of model evaluation in Section 4.5.150 Chapter 4 Classification 4.3 Decision Tree Induction This section introduces a decision tree classifier, which is a simple yet widely used classification technique. 4.3.1 How a Decision Tree Works To illustrate how classification with a decision tree works, consider a simpler version of the vertebrate classification problem described in the previous sec- tion. Instead of classifying the vertebrates into five distinct groups of species, we assign them to two categories: mammals and non-mammals. Suppose a new species is discovered by scientists. How can we tell whether it is a mammal or a non-mammal? One approach is to pose a series of questions about the characteristics of the species. The first question we may ask is whether the species is cold- or warm-blooded. If it is cold-blooded, then it is definitely not a mammal. Otherwise, it is either a bird or a mammal. In the latter case, we need to ask a follow-up question: Do the females of the species give birth to their young? Those that do give birth are definitely mammals, while those that do not are likely to be non-mammals (with the exception of egg-laying mammals such as the platypus and spiny anteater). The previous example illustrates how we can solve a classification problem by asking a series of carefully crafted questions about the attributes of the test record. Each time we receive an answer, a follow-up question is asked until we reach a conclusion about the class label of the record. The series of questions and their possible answers can be organized in the form of a decision tree, which is a hierarchical structure consisting of nodes and directed edges. Figure 4.4 shows the decision tree for the mammal classification problem. The tree has three types of nodes: • A root node that has no incoming edges and zero or more outgoing edges. • Internal nodes, each of which has exactly one incoming edge and two or more outgoing edges. • Leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing edges. In a decision tree, each leaf node is assigned a class label. The non- terminal nodes, which include the root and other internal nodes, contain attribute test conditions to separate records that have different characteris- tics. For example, the root node shown in Figure 4.4 uses the attribute Body4.3 Decision Tree Induction 151 Body Root Temperature node Internal Cold Warm node Non- Gives Birth mammals Yes No Leaf Non- nodes Mammals mammals Figure 4.4. A decision tree for the mammal classification problem. Temperature to separate warm-blooded from cold-blooded vertebrates. Since all cold-blooded vertebrates are non-mammals, a leaf node labeledNon-mammals is created as the right child of the root node. If the vertebrate is warm-blooded, a subsequent attribute, Gives Birth, is used to distinguish mammals from other warm-blooded creatures, which are mostly birds. Classifying a test record is straightforward once a decision tree has been constructed. Starting from the root node, we apply the test condition to the record and follow the appropriate branch based on the outcome of the test. This will lead us either to another internal node, for which a new test condition is applied, or to a leaf node. The class label associated with the leaf node is then assigned to the record. As an illustration, Figure 4.5 traces the path in the decision tree that is used to predict the class label of a flamingo. The path terminates at a leaf node labeled Non-mammals. 4.3.2 How to Build a Decision Tree In principle, there are exponentially many decision trees that can be con- structed from a given set of attributes. While some of the trees are more accu- rate than others, finding the optimal tree is computationally infeasible because of the exponential size of the search space. Nevertheless, efficient algorithms have been developed to induce a reasonably accurate, albeit suboptimal, de- cision tree in a reasonable amount of time. These algorithms usually employ a greedy strategy that grows a decision tree by making a series of locally op-152 Chapter 4 Classification Unlabeled Name Body temperature Gives Birth ... Class data Flamingo Warm No ... ? Body Non- Temperature mammals Warm Cold Non- Gives Birth mammals Yes No Non- Mammals mammals Figure 4.5. Classifying an unlabeled vertebrate. The dashed lines represent the outcomes of applying various attribute test conditions on the unlabeled vertebrate. The vertebrate is eventually assigned to theNon-mammal class. timum decisions about which attribute to use for partitioning the data. One such algorithm is Hunt’s algorithm, which is the basis of many existing de- cision tree induction algorithms, including ID3, C4.5, and CART. This section presents a high-level discussion of Hunt’s algorithm and illustrates some of its design issues. Hunt’s Algorithm In Hunt’s algorithm, a decision tree is grown in a recursive fashion by parti- tioning the training records into successively purer subsets. Let D be the set t of training records that are associated with node t and y =y ,y ,...,y be 1 2 c the class labels. The following is a recursive definition of Hunt’s algorithm. Step 1: If all the records in D belong to the same class y , then t is a leaf t t node labeled as y . t Step 2: If D contains records that belong to more than one class, an at- t tribute test condition is selected to partition the records into smaller subsets. A child node is created for each outcome of the test condi- tion and the records in D are distributed to the children based on the t outcomes. The algorithm is then recursively applied to each child node.4.3 Decision Tree Induction 153 Home Marital Annual Defaulted Tid Owner Status Income Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Figure 4.6. Training set for predicting borrowers who will default on loan payments. To illustrate how the algorithm works, consider the problem of predicting whether a loan applicant will repay her loan obligations or become delinquent, subsequently defaulting on her loan. A training set for this problem can be constructed by examining the records of previous borrowers. In the example shown in Figure 4.6, each record contains the personal information of a bor- rower along with a class label indicating whether the borrower has defaulted on loan payments. The initial tree for the classification problem contains a single node with class label Defaulted = No (see Figure 4.7(a)), which means that most of the borrowers successfully repaid their loans. The tree, however, needs to be refined since the root node contains records from both classes. The records are subsequently divided into smaller subsets based on the outcomes of the Home Owner test condition, as shown in Figure 4.7(b). The justification for choosing this attribute test condition will be discussed later. For now, we will assume that this is the best criterion for splitting the data at this point. Hunt’s algorithm is then applied recursively to each child of the root node. From the training set given in Figure 4.6, notice that all borrowers who are home owners successfully repaid their loans. The left child of the root is therefore a leaf node labeled Defaulted = No (see Figure 4.7(b)). For the right child, we need to continue applying the recursive step of Hunt’s algorithm until all the records belong to the same class. The trees resulting from each recursive step are shown in Figures 4.7(c) and (d). binary categorical continuous class154 Chapter 4 Classification Home Owner Yes No Defaulted = No Defaulted = No Defaulted = No (a) (b) Home Owner Yes No Home Marital Defaulted = No Owner Status Single, Yes No Married Divorced Marital Annual Defaulted = No Defaulted = No Status Income Single, Married 80K = 80K Divorced Defaulted = Yes Defaulted = No Defaulted = No Defaulted = Yes (c) (d) Figure 4.7. Hunt’s algorithm for inducing decision trees. Hunt’s algorithm will work if every combination of attribute values is present in the training data and each combination has a unique class label. These assumptions are too stringent for use in most practical situations. Ad- ditional conditions are needed to handle the following cases: 1. It is possible for some of the child nodes created in Step 2 to be empty; i.e., there are no records associated with these nodes. This can happen if none of the training records have the combination of attribute values associated with such nodes. In this case the node is declared a leaf node with the same class label as the majority class of training records associated with its parent node. 2. In Step 2, if all the records associated with D have identical attribute t values (except for the class label), then it is not possible to split these records any further. In this case, the node is declared a leaf node with the same class label as the majority class of training records associated with this node.4.3 Decision Tree Induction 155 Design Issues of Decision Tree Induction A learning algorithm for inducing decision trees must address the following two issues. 1. How should the training records be split? Each recursive step of the tree-growing process must select an attribute test condition to divide the records into smaller subsets. To implement this step, the algorithm must provide a method for specifying the test condition for different attribute types as well as an objective measure for evaluating the goodness of each test condition. 2. How should the splitting procedure stop? A stopping condition is needed to terminate the tree-growing process. A possible strategy is to continue expanding a node until either all the records belong to the same class or all the records have identical attribute values. Although both conditions are sufficient to stop any decision tree induction algorithm, other criteria can be imposed to allow the tree-growing procedure to terminate earlier. The advantages of early termination will be discussed later in Section 4.4.5. 4.3.3 Methods for Expressing Attribute Test Conditions Decision tree induction algorithms must provide a method for expressing an attribute test condition and its corresponding outcomes for different attribute types. Binary Attributes The test condition for a binary attribute generates two potential outcomes, as shown in Figure 4.8. Body Temperature Warm- Cold- blooded blooded Figure 4.8. Test condition for binary attributes.156 Chapter 4 Classification Marital Status Single Divorced Married (a) Multiway split Marital Marital Marital Status Status Status OR OR Married Single, Single Married, Single, Divorced Divorced Divorced Married (b) Binary split by grouping attribute values Figure 4.9. Test conditions for nominal attributes. Nominal Attributes Since a nominal attribute can have many values, its test condition can be expressed in two ways, as shown in Figure 4.9. For a multiway split (Figure 4.9(a)), the number of outcomes depends on the number of distinct values for the corresponding attribute. For example, if an attribute such as marital status has three distinct values—single, married, or divorced—its test condition will produce a three-way split. On the other hand, some decision tree algorithms, such as CART, produce only binary splits k−1 by considering all 2 − 1 ways of creating a binary partition of k attribute values. Figure 4.9(b) illustrates three different ways of grouping the attribute values for marital status into two subsets. OrdinalAttributes Ordinal attributes can also produce binary or multiway splits. Ordinal attribute values can be grouped as long as the grouping does not violate the order property of the attribute values. Figure 4.10 illustrates various ways of splitting training records based on the Shirt Size attribute. The groupings shown in Figures 4.10(a) and (b) preserve the order among the attribute values, whereas the grouping shown in Figure 4.10(c) violates this property because it combines the attribute values Small and Large into4.3 Decision Tree Induction 157 Shirt Shirt Shirt Size Size Size Small, Large, Small Medium, Large, Small, Medium, Medium Extra Large Extra Large Large Extra Large (a) (b) (c) Figure 4.10. Different ways of grouping ordinal attribute values. the same partition while Medium and Extra Large are combined into another partition. Continuous Attributes For continuous attributes, the test condition can be expressed as a comparison test (Av)or(A≥ v) with binary outcomes, or a range query with outcomes of the form v ≤Av , for i=1,...,k.The i i+1 difference between these approaches is shown in Figure 4.11. For the binary case, the decision tree algorithm must consider all possible split positions v, and it selects the one that produces the best partition. For the multiway split, the algorithm must consider all possible ranges of continuous values. One approach is to apply the discretization strategies described in Section 2.3.6 on page 57. After discretization, a new ordinal value will be assigned to each discretized interval. Adjacent intervals can also be aggregated into wider ranges as long as the order property is preserved. Annual Annual Income Income 80K 10K 80K Yes No 10K, 25K 25K, 50K 50K, 80K (a) (b) Figure 4.11. Test condition for continuous attributes.158 Chapter 4 Classification Car Customer Gender Type ID Luxury Male Female Family v v 1 20 v v 10 11 Sports C0: 6 C0: 4 C0:1 C0: 8 C0: 1 C0: 1 C0: 1 C0: 0 C0: 0 . . . . . . C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1 (a) (b) (c) Figure 4.12. Multiway versus binary splits. 4.3.4 Measures for Selecting the Best Split There are many measures that can be used to determine the best way to split the records. These measures are defined in terms of the class distribution of the records before and after splitting. Let p(it) denote the fraction of records belonging to class i at a given node t. We sometimes omit the reference to node t and express the fraction as p . i In a two-class problem, the class distribution at any node can be written as (p ,p ), where p =1− p . To illustrate, consider the test conditions shown 0 1 1 0 in Figure 4.12. The class distribution before splitting is (0.5, 0.5) because there are an equal number of records from each class. If we split the data using the Gender attribute, then the class distributions of the child nodes are (0.6, 0.4) and (0.4, 0.6), respectively. Although the classes are no longer evenly distributed, the child nodes still contain records from both classes. Splitting on the second attribute, Car Type, will result in purer partitions. The measures developed for selecting the best split are often based on the degree of impurity of the child nodes. The smaller the degree of impurity, the more skewed the class distribution. For example, a node with class distribu- tion (0, 1) has zero impurity, whereas a node with uniform class distribution (0.5, 0.5) has the highest impurity. Examples of impurity measures include c−1  Entropy(t)= − p(it)log p(it), (4.3) 2 i=0 c−1  2 Gini(t)=1− p(it) , (4.4) i=0 Classification error(t)=1− maxp(it), (4.5) i where c is the number of classes and 0 log 0 = 0 in entropy calculations. 24.3 Decision Tree Induction 159 1 0.9 Entropy 0.8 0.7 0.6 0.5 Gini 0.4 0.3 Misclassification error 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p Figure 4.13. Comparison among the impurity measures for binary classification problems. Figure 4.13 compares the values of the impurity measures for binary classi- fication problems. p refers to the fraction of records that belong to one of the two classes. Observe that all three measures attain their maximum value when the class distribution is uniform (i.e., when p=0.5). The minimum values for the measures are attained when all the records belong to the same class (i.e., when p equals 0 or 1). We next provide several examples of computing the different impurity measures. 2 2 Node N Count Gini = 1− (0/6) − (6/6) =0 1 Class=0 0 Entropy =−(0/6)log (0/6)− (6/6)log (6/6) = 0 2 2 Error = 1− max0/6,6/6 = 0 Class=1 6 2 2 Node N Count Gini = 1− (1/6) − (5/6) =0.278 2 Class=0 1 Entropy =−(1/6)log (1/6)− (5/6)log (5/6) = 0.650 2 2 Error = 1− max1/6,5/6 = 0.167 Class=1 5 2 2 Node N Count Gini = 1− (3/6) − (3/6) =0.5 3 Class=0 3 Entropy =−(3/6)log (3/6)− (3/6)log (3/6) = 1 2 2 Error = 1− max3/6,3/6 = 0.5 Class=1 3160 Chapter 4 Classification The preceding examples, along with Figure 4.13, illustrate the consistency among different impurity measures. Based on these calculations, node N has 1 the lowest impurity value, followed by N and N . Despite their consistency, 2 3 the attribute chosen as the test condition may vary depending on the choice of impurity measure, as will be shown in Exercise 3 on page 198. To determine how well a test condition performs, we need to compare the degree of impurity of the parent node (before splitting) with the degree of impurity of the child nodes (after splitting). The larger their difference, the better the test condition. The gain, ∆, is a criterion that can be used to determine the goodness of a split: k  N(v ) j ∆= I(parent)− I(v ), (4.6) j N j=1 where I(·) is the impurity measure of a given node, N is the total number of records at the parent node, k is the number of attribute values, and N(v ) j is the number of records associated with the child node, v . Decision tree j induction algorithms often choose a test condition that maximizes the gain ∆. Since I(parent) is the same for all test conditions, maximizing the gain is equivalent to minimizing the weighted average impurity measures of the child nodes. Finally, when entropy is used as the impurity measure in Equation 4.6, the difference in entropy is known as the information gain,∆ . info Splitting of Binary Attributes Consider the diagram shown in Figure 4.14. Suppose there are two ways to split the data into smaller subsets. Before splitting, the Gini index is 0.5 since there are an equal number of records from both classes. If attribute A is chosen to split the data, the Gini index for node N1 is 0.4898, and for node N2, it is 0.480. The weighted average of the Gini index for the descendent nodes is (7/12)× 0.4898 + (5/12)× 0.480 = 0.486. Similarly, we can show that the weighted average of the Gini index for attribute B is 0.375. Since the subsets for attribute B have a smaller Gini index, it is preferred over attribute A. Splitting of Nominal Attributes As previously noted, a nominal attribute can produce either binary or multi- way splits, as shown in Figure 4.15. The computation of the Gini index for a binary split is similar to that shown for determining binary attributes. For the first binary grouping of the Car Type attribute, the Gini index of Sports,4.3 Decision Tree Induction 161 Parent C0 6 C1 6 Gini = 0.500 A B Yes No Yes No Node N1 Node N2 Node N1 Node N2 N1 N2 N1 N2 C0 4 2 C0 1 5 C1 3 3 C1 4 2 Gini = 0.486 Gini = 0.375 Figure 4.14. Splitting binary attributes. Car Type Car Type Car Type Sports, Family, Family Luxury Family Sports Luxury Luxury Sports Car Type Car Type Car Type Sports, Family, Family Family Sports Luxury Sports Luxury Luxury C0 9 1 C0 8 2 C0 1 8 1 C1 7 3 C1 0 10 C1 3 0 7 Gini 0.468 Gini 0.167 Gini 0.163 (a) Binary split (b) Multiway split Figure 4.15. Splitting nominal attributes. Luxury is 0.4922 and the Gini index of Family is 0.3750. The weighted average Gini index for the grouping is equal to 16/20× 0.4922 + 4/20× 0.3750 = 0.468. Similarly, for the second binary grouping ofSports andFamily, Luxury, the weighted average Gini index is 0.167. The second grouping has a lower Gini index because its corresponding subsets are much purer.162 Chapter 4 Classification Class No No No Yes Yes Yes No No No No Annual Income 60 70 75 85 90 95 100 120 125 220 Sorted Values 55 65 72 80 87 92 97 110 122 172 230 Split Positions = = = = = = = = = = = Yes 03 03 03 03 12 21 30 30 30 30 30 No 07 16 25 34 34 34 34 43 52 61 70 Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420 Figure 4.16. Splitting continuous attributes. For the multiway split, the Gini index is computed for every attribute value. Since Gini(Family) = 0.375, Gini(Sports) = 0, and Gini(Luxury)= 0.219, the overall Gini index for the multiway split is equal to 4/20× 0.375 + 8/20×0+8/20× 0.219 = 0.163. The multiway split has a smaller Gini index compared to both two-way splits. This result is not surprising because the two-way split actually merges some of the outcomes of a multiway split, and thus, results in less pure subsets. Splitting of Continuous Attributes Consider the example shown in Figure 4.16, in which the test conditionAnnual Income ≤ v is used to split the training records for the loan default classifica- tion problem. A brute-force method for finding v is to consider every value of the attribute in the N records as a candidate split position. For each candidate v, the data set is scanned once to count the number of records with annual income less than or greater than v. We then compute the Gini index for each candidate and choose the one that gives the lowest value. This approach is computationally expensive because it requires O(N) operations to compute the Gini index at each candidate split position. Since there are N candidates, 2 the overall complexity of this task is O(N ). To reduce the complexity, the training records are sorted based on their annual income, a computation that requires O(N log N) time. Candidate split positions are identified by taking the midpoints between two adjacent sorted values: 55, 65, 72, and so on. How- ever, unlike the brute-force approach, we do not have to examine all N records when evaluating the Gini index of a candidate split position. For the first candidate, v = 55, none of the records has annual income less than 55K. As a result, the Gini index for the descendent node with Annual4.3 Decision Tree Induction 163 Income 55K is zero. On the other hand, the number of records with annual income greater than or equal to 55K is 3 (for class Yes) and 7 (for class No), respectively. Thus, the Gini index for this node is 0.420. The overall Gini index for this candidate split position is equal to 0×0+1× 0.420 = 0.420. For the second candidate, v = 65, we can determine its class distribution by updating the distribution of the previous candidate. More specifically, the new distribution is obtained by examining the class label of the record with the lowest annual income (i.e., 60K). Since the class label for this record is No, the count for classNo is increased from 0 to 1 (for Annual Income≤ 65K) and is decreased from 7 to 6 (for Annual Income 65K). The distribution for class Yes remains unchanged. The new weighted-average Gini index for this candidate split position is 0.400. This procedure is repeated until the Gini index values for all candidates are computed, as shown in Figure 4.16. The best split position corresponds to the one that produces the smallest Gini index, i.e., v = 97. This procedure is less expensive because it requires a constant amount of time to update the class distribution at each candidate split position. It can be further optimized by considering only candidate split positions located between two adjacent records with different class labels. For example, because the first three sorted records (with annual incomes 60K, 70K, and 75K) have identical class labels, the best split position should not reside between 60K and 75K. Therefore, the candidate split positions at v = 55K, 65K, 72K, 87K, 92K, 110K, 122K, 172K, and 230K are ignored because they are located between two adjacent records with the same class labels. This approach allows us to reduce the number of candidate split positions from 11 to 2. Gain Ratio Impurity measures such as entropy and Gini index tend to favor attributes that have a large number of distinct values. Figure 4.12 shows three alternative test conditions for partitioning the data set given in Exercise 2 on page 198. Comparing the first test condition, Gender, with the second, Car Type,it is easy to see that Car Type seems to provide a better way of splitting the data since it produces purer descendent nodes. However, if we compare both conditions with Customer ID, the latter appears to produce purer partitions. Yet Customer ID is not a predictive attribute because its value is unique for each record. Even in a less extreme situation, a test condition that results in a large number of outcomes may not be desirable because the number of records associated with each partition is too small to enable us to make any reliable predictions.164 Chapter 4 Classification There are two strategies for overcoming this problem. The first strategy is to restrict the test conditions to binary splits only. This strategy is employed by decision tree algorithms such as CART. Another strategy is to modify the splitting criterion to take into account the number of outcomes produced by the attribute test condition. For example, in the C4.5 decision tree algorithm, a splitting criterion known as gain ratio is used to determine the goodness of a split. This criterion is defined as follows: ∆ info Gain ratio = . (4.7) Split Info  k Here, Split Info =− P(v )log P(v ) and k is the total number of splits. i i 2 i=1 For example, if each attribute value has the same number of records, then ∀i : P(v)=1/k and the split information would be equal to log k. This i 2 example suggests that if an attribute produces a large number of splits, its split information will also be large, which in turn reduces its gain ratio. 4.3.5 Algorithm for Decision Tree Induction A skeleton decision tree induction algorithm called TreeGrowth is shown in Algorithm 4.1. The input to this algorithm consists of the training records E and the attribute set F. The algorithm works by recursively selecting the best attribute to split the data (Step 7) and expanding the leaf nodes of the Algorithm 4.1 A skeleton decision tree induction algorithm. TreeGrowth (E, F) 1: if stopping cond(E,F)= true then 2: leaf = createNode(). 3: leaf.label = Classify(E). 4: return leaf. 5: else 6: root = createNode(). 7: root.test cond = find best split(E, F). 8: let V =vv is a possible outcome of root.test cond . 9: for each v∈ V do 10: E =e root.test cond(e)= v and e∈ E. v 11: child = TreeGrowth(E , F). v 12: add child as descendent of root and label the edge (root→ child)as v. 13: end for 14: end if 15: return root.

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.