Data mining classification Decision tree

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation and data mining classification methods
Dr.JakeFinlay Profile Pic
Dr.JakeFinlay,Germany,Teacher
Published Date:22-07-2017
Your Website URL(Optional)
Comment
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1Classification: Definition  Given a collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class.  Find a model for class attribute as a function of the values of other attributes.  Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Illustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class Learning No 1 Yes Large 125K algorithm 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No Induction 5 No Large 95K Yes No 6 No Medium 60K Learn 7 Yes Large 220K No 8 No Small 85K Yes Model 9 No Medium 75K No 10 No Small 90K Yes 10 Model Training Set Apply Model Attrib1 Attrib2 Attrib3 Class Tid 11 No Small 55K ? 12 Yes Medium 80K ? Deduction 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Examples of Classification Task  Predicting tumor cells as benign or malignant  Classifying credit card transactions as legitimate or fraudulent  Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil  Categorizing news stories as finance, weather, entertainment, sports, etc © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Classification Techniques  Decision Tree based Methods  Rule-based Methods  Memory based reasoning  Neural Networks  Naïve Bayes and Bayesian Belief Networks  Support Vector Machines © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Example of a Decision Tree Splitting Attributes Tid Refund Marital Taxable Cheat Status Income 1 Yes Single 125K No Refund 2 No Married 100K No Yes No 3 No Single 70K No 4 Yes Married 120K No NO MarSt 5 No Divorced 95K Yes Married Single, Divorced 6 No Married 60K No TaxInc NO No 7 Yes Divorced 220K 80K 80K Yes 8 No Single 85K 9 No Married 75K No NO YES 10 No Single 90K Yes 10 Model: Decision Tree Training Data © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Another Example of Decision Tree Single, MarSt Married Divorced Tid Refund Marital Taxable Cheat Status Income NO Refund 1 Yes Single 125K No No Yes No 2 No Married 100K 3 No Single 70K No NO TaxInc 4 Yes Married 120K No 80K 80K 5 No Divorced 95K Yes YES NO 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes No There could be more than one tree that 9 No Married 75K fits the same data 10 No Single 90K Yes 10 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Decision Tree Classification Task Tree Tid Attrib1 Attrib2 Attrib3 Class Induction No 1 Yes Large 125K algorithm 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No Induction 5 No Large 95K Yes No 6 No Medium 60K Learn 7 Yes Large 220K No 8 No Small 85K Yes Model 9 No Medium 75K No 10 No Small 90K Yes 10 Model Training Set Apply Decision Model Tree Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? Deduction 13 Yes Large 110K ? 14 No Small 95K ? ? 15 No Large 67K 10 Test Set © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Apply Model to Test Data Test Data Start from the root of tree. Refund Marital Taxable Cheat Status Income No Married 80K ? 10 Refund Yes No NO MarSt Married Single, Divorced TaxInc NO 80K 80K NO YES © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Apply Model to Test Data Test Data Refund Marital Taxable Cheat Status Income No Married 80K ? 10 Refund Yes No NO MarSt Married Single, Divorced TaxInc NO 80K 80K NO YES © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Apply Model to Test Data Test Data Refund Marital Taxable Cheat Status Income No Married 80K ? 10 Refund Yes No NO MarSt Married Single, Divorced TaxInc NO 80K 80K NO YES © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Apply Model to Test Data Test Data Refund Marital Taxable Cheat Status Income No Married 80K ? 10 Refund Yes No NO MarSt Married Single, Divorced TaxInc NO 80K 80K NO YES © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Apply Model to Test Data Test Data Refund Marital Taxable Cheat Status Income No Married 80K ? 10 Refund Yes No NO MarSt Married Single, Divorced TaxInc NO 80K 80K NO YES © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Apply Model to Test Data Test Data Refund Marital Taxable Cheat Status Income No Married 80K ? 10 Refund Yes No NO MarSt Assign Cheat to “No” Married Single, Divorced TaxInc NO 80K 80K NO YES © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Decision Tree Classification Task Tree Tid Attrib1 Attrib2 Attrib3 Class Induction 1 Yes Large 125K No algorithm 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No Induction 5 No Large 95K Yes 6 No Medium 60K No Learn 7 Yes Large 220K No Yes 8 No Small 85K Model 9 No Medium 75K No 10 No Small 90K Yes 10 Model Training Set Apply Decision Model Tree Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? ? 12 Yes Medium 80K Deduction 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Decision Tree Induction  Many Algorithms: – Hunt’s Algorithm (one of the earliest) – CART – ID3, C4.5 – SLIQ,SPRINT © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›General Structure of Hunt’s Algorithm Tid Refund Marital Taxable  Let D be the set of training records Cheat Status Income t that reach a node t 1 Yes Single 125K No 2 No Married 100K No  General Procedure: 3 No Single 70K No – If D contains records that t 4 Yes Married 120K No belong the same class y , then t t 5 No Divorced 95K Yes is a leaf node labeled as y t 6 No Married 60K No – If D is an empty set, then t is a 7 Yes Divorced 220K No t leaf node labeled by the default 8 No Single 85K Yes 9 No Married 75K No class, y d 10 No Single 90K Yes 10 – If D contains records that t belong to more than one class, D t use an attribute test to split the data into smaller subsets. Recursively apply the ? procedure to each subset. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Tid Refund Marital Taxable Cheat Hunt’s Algorithm Status Income 1 Yes Single 125K No 2 No Married 100K No Refund 3 No Single 70K No Don’t Yes No Cheat 4 Yes Married 120K No Don’t Don’t 5 No Divorced 95K Yes Cheat Cheat 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes Refund 9 No Married 75K No Refund Yes No 10 No Single 90K Yes No Yes 10 Don’t Marital Don’t Marital Cheat Cheat Status Status Single, Single, Married Married Divorced Divorced Don’t Don’t Taxable Cheat Cheat Cheat Income 80K = 80K Don’t Cheat Cheat © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Tree Induction  Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion.  Issues – Determine how to split the records How to specify the attribute test condition? How to determine the best split? – Determine when to stop splitting © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Tree Induction  Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion.  Issues – Determine how to split the records How to specify the attribute test condition? How to determine the best split? – Determine when to stop splitting © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›