Data mining and warehousing lab manual using Weka

lab manual for data mining and warehousing and data mining lab manual weka
LavinaKanna Profile Pic
Published Date:13-07-2017
Your Website URL(Optional)
St. MARTIN’s ENGINERING COLLEGE Dhulapally(V), Qutbullapur(M), Secunderabad-500014 COMPUTER SCIENCE AND ENGINEERING LAB MANUAL OF DATAWAREHOUSE AND DATAMINING IV B. Tech I semester (JNTUH-R13) Prepared by P.CHANDRASHAKER REDDY Associate Professor S.SRILAXMI Assistant Professor G. SATISH Assistant Professor G. PUSPHA RAJITHA Assistant Professor St. Martins Engineering College Exp No__ Date____ SNo Experiment Name Date Listing of categorical attributes and the real-valued attributes separately. Rules for identifying attributes. Training a decision tree. Test on classification of decision tree. Testing on the training set . Using cross –validation for training. Significance of attributes in decision tree. Trying generation of decision tree with various number of decision tree. Find out differences in results using decision tree and cross-validation on a data set. Decision trees. Reduced error pruning for training Decision Trees using cross-validation Convert a Decision Trees into "if-then-else rules". Roll No.___________________ Page 1 St. Martins Engineering College Exp No__ Date____ WEKA INTRODUCTION Weka is a collection machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU general Public License. Weka stands for the Waikato Environment for Knowledge Analysis, which was developed at the University of Waikato in New Zealand. WEKA is extensible and has become a collection of machine learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost every platform. WEKA is easy to use and to be applied at several different levels. You can access the WEKA class library from your own Java Program, and implemented new machine learning algorithms. There are three major implemented schemes, WEKA. 1. Implemented schemes for classification, 2. Implemented schemes for numeric prediction, 3. Implemented “mete-schemes”. Besides actual learning schemes, WEKA also contains a large variety tools that can be used for pre-processing datasets, so that you can focus on your algorithm without considering too much details as reading the data from files, implementing filtering algorithm and providing code to evaluate the results. The weka GUI chooser provides a starting point launching Weka’s main GUI applications and supporting tools. If one prefers a MDI (“Multiple Document Interface”) appearance, then this is provided by an alternative launcher called “Main” (class weka.gui.Main). Roll No.___________________ Page 2 St. Martins Engineering College Exp No__ Date____ The GUI Chooser consists of four buttons-one for each of four major Weka applications and four for menus. The buttons can be used to start the following applications:  Explorer An environment for exploring data with WEKA.  Experimenter An environment for supporting experiments and conducting statistical tests between learning schemes.  Knowledge Flow this environment supports essentially the same functions as the Explorer but with a drag and drop interface. One advantage is that is supports incremental learning.  Simple CLI Provides a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface. Weka GUI Chooser Roll No.___________________ Page 3 St. Martins Engineering College Exp No__ Date____ Credit Risk Assessment Description: The business of banks is making loans. Assessing the credit worthiness of an applicant is of crucial importance. You have to develop a system to help a loan officer decide whether the credit of a customer is good, or bad. A bank’s business rules regarding loans must consider two opposing factors. On the one hand, a bank wants to make as many loans as possible. Interest on these loans is the ban’s profit source. On the other hand, a bank cannot afford to make too many bad loans. Too many bad loans could lead to the collapse of the bank. The bank’s loan policy must involve a compromise not too strict, and not too lenient. To do the assignment, you first and foremost need some knowledge about the world of credit. You can acquire such knowledge in a number of ways. 1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview her and try to represent her knowledge in the form of production rules. 2. Books. Find some training manuals for loan officers or perhaps a suitable textbook on finance. Translate this knowledge from text form to production rule form. 3. Common sense. Imagine yourself as a loan officer and make up reasonable rules which can be used to judge the credit worthiness of a loan applicant. 4. Case histories. Find records of actual cases where competent loan officers correctly judged when not to, approve a loan application. The German Credit Data: Actual historical credit data is not always easy to come by because of confidentiality rules. Here is one such dataset (original) Excel spreadsheet version of the German credit data (download from web). In spite of the fact that the data is German, you should probably make use of it for this assignment, (Unless you really can consult a real loan officer) A few notes on the German dataset:  DM stands for Deutsche Mark, the unit of currency, worth about 90 cents Canadian (but looks and acts like a quarter).  Owns telephone: German phone rates are much higher than in Canada so fewer people own telephones.  Foreign worker: There are millions of these in Germany (many from Turkey). It is very hard to get German citizenship if you were not born of German parents.  There are 20 attributes used in judging a loan applicant. The goal is the classify the applicant into one of two categories, good or bad. Roll No.___________________ Page 4 St. Martins Engineering College Exp No__ Date____ EXPERIMENT-1 List all the categorical (or nominal) attributes and the real-valued attributes separately. Aim: To list all the categorical (or nominal) attributes and the real valued attributes using Weka mining tool. Tools/ Apparatus: Weka Mining tool. Procedure: 1) Open the Weka GUI Chooser. 2) Select EXPLORER present in Applications. 3) Select Preprocess Tab. 4) Go to OPEN file and browse the file that is already stored in the system “German credit data.arff” 5) Clicking on any attribute in the left panel will show the basic statistics on that selected attribute. Sample Output: Roll No.___________________ Page 5 St. Martins Engineering College Exp No__ Date____ 6) Select choose button then select weka-filters-unsupervised-attribute and here select the option as remove type 7) Click anywhere on Remove type white box which is beside the choose button to open weka.gui.GenericObjectEditor window. 8) From this select the attribute type as delete nominal attributes and select the option for invertSelection as true. 9) Click on ok button and then click on Apply button, then it will show the nominal attributes as follows Roll No.___________________ Page 6 St. Martins Engineering College Exp No__ Date____ 10) Click on undo button and then repeat the above steps for displaying all numerical attributes by selecting the attribute type as Delete numeric Attributes. Output: Categorical/ Nominal Attributes: 1. Checking_status 2.Credit_history 3.Purpose 4.Savings_status 5. Employment 6. Personal_status 7.Other_parties 8.Property_magnitude 9. Other_payment_plans 10. Housing 11.Job 12.own_telephone 13. foreign_worker 14. Class Numeric Attributes: 1. Duration 2. Credit_amount 3. Installment_commitment 4. Residece_since 5. Age 6. Existing_credits 7. Num_dependents. Result: Hence all categorical and Numerical attributes are displayed. Roll No.___________________ Page 7 St. Martins Engineering College Exp No__ Date____ EXPERIMENT-2 What attributes do you think might be crucial in making the credit assessment? Come up with some simple rules in plain English using your selected attributes. Aim: To identify the rules with some of the important attributes by a) Manually and b) Using Weka Tools/ Apparatus: Weka mining tool. According to me the following attributes may be crucial in making the credit risk assessment. 1. Credit_history 2. Employment 3. Property_magnitude 4. job 5. duration 6. crdit_amount 7. installment 8. existing credit Based on the above attributes, we can make a decision whether to give credit or not. Theory: Association rule mining is defined as: Let be a set of n binary attributes called items. Let be a set of transactions called the database. Each transaction in D has a unique transaction ID and contains a subset of the items in I. A rule is defined as an implication of the form X=Y where X, Y C I and X Π Y=Φ. The sets of items (for short item sets) X and Y are called antecedent (left hand side or LHS) and consequent (righthandside or RHS) of the rule respectively. The set of items is I = milk, bread, butter, beer and a small database containing the items (1 codes presence and 0 absence of an item in a transaction) is shown in the table to the right. An example rule for the supermarket could be meaning that if milk and bread is bought, customers also buy butter. Roll No.___________________ Page 8 St. Martins Engineering College Exp No__ Date____ To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best known constraints are minimum thresholds on support and confidence. The support supp(X) of an item set X is defined as the proportion of transactions in the data set which contain the item set. In the example database, the item set milk, bread has a support of 2 / 5 = 0.4 since it occurs in 40% of all transactions (2 out of 5 transactions). The confidence of a rule is defined. For example, the rule has a confidence of 0.2 / 0.4 = 0.5 in the database, which means that for 50% of the transactions containing milk and bread the rule is correct. Confidence can be interpreted as an estimate of the probability P(Y X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS. ALGORITHM: Association rule mining is to find out association rules that satisfy the predefined minimum support and confidence from a given database. The problem is usually decomposed into two sub problems. One is to find those item sets whose occurrences exceed a predefined threshold in the database; Those item sets are called frequent or large item sets. The second problem is to generate association rules from those large item sets with the constraints of minimal confidence. Suppose one of the large item sets is Lk, Lk = I1, I2, … , Ik, association rules with this item sets are generated in the following way: the first rule is I1, I2, … , Ik1 and Ik, by checking the confidence this rule can be determined as interesting or not. Then other rule are generated by deleting the last items in the antecedent and inserting it to the consequent, further the confidences of the new rules are checked to determine the interestingness of them. Those processes iterated until the antecedent becomes empty. Since the second sub problem is quite straight forward, most of the researches focus on the first sub problem. The Apriori algorithm finds the frequent sets L In Database D. · Find frequent set Lk − 1. · Join Step. .Ck is generated by joining Lk − 1with itself · Prune Step. Roll No.___________________ Page 9 St. Martins Engineering College Exp No__ Date____ .Any (k − 1) item set that is not frequent cannot be a subset of a frequent k item set, hence should be removed, Where . (Ck: Candidate item set of size k) · (Lk: frequent item set of size k) Apriori Pseudo code Apriori (T,£) L Large 1itemsets that appear in more than transactions K2 While L(k1)≠ Φ C(k)Generate( Lk − 1) for transactions t € T C(t)Subset(Ck,t) for candidates c € C(t) countccount c+1 L(k) c € C(k) countc ≥ £ KK+ 1 return Ụ L(k) k Procedure: 1) Given the Bank database for mining. 2) Select EXPLORER in WEKA GUI Chooser. 3) Load “Bank.csv” in Weka by Open file in Preprocess tab. 4) Select only Nominal values. 5) Go to Associate Tab. 6) Select Apriori algorithm from Choose button present in Associator weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1 7) Select Start button Roll No.___________________ Page 10 St. Martins Engineering College Exp No__ Date____ 8) Now we can see the sample rules. Sample Output: Roll No.___________________ Page 11 St. Martins Engineering College Exp No__ Date____ EXPERIMENT-3 One type of model that you can create is a Decision Tree - train a Decision Tree using the complete dataset as the training data. Report the model obtained after training. Aim: To create a Decision tree by training data set using Weka mining tool. Tools/ Apparatus: Weka mining tool. Theory: Classification is a data mining function that assigns items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data. For example, a classification model could be used to identify loan applicants as low, medium, or high credit risks. A classification task begins with a data set in which the class assignments are known. For example, a classification model that predicts credit risk could be developed based on observed data for many loan applicants over a period of time. In addition to the historical credit rating, the data might track employment history, home ownership or rental, years of residence, number and type of investments, and so on. Credit rating would be the target, the other attributes would be the predictors, and the data for each customer would constitute a case. Classifications are discrete and do not imply order. Continuous, floating point values would indicate a numerical, rather than a categorical, target. A predictive model with a numerical target uses a regression algorithm, not a classification algorithm. The simplest type of classification problem is binary classification. In binary classification, the target attribute has only two possible values: for example, high credit rating or low credit rating. Multiclass targets have more than two values: for example, low, medium, high, or unknown credit rating. In the model build (training) process, a classification algorithm finds relationships between the values of the predictors and the values of the target. Different classification algorithms use different techniques for finding relationships. These relationships are summarized in a model, which can then be applied to a different data set in which the class assignments are unknown. Classification models are tested by comparing the predicted values to known target values in a set of test data. The historical data for a classification project is typically divided into two data Roll No.___________________ Page 12 St. Martins Engineering College Exp No__ Date____ sets: one for building the model; The other for testing the model. Scoring a classification model results in class assignments and probabilities for each case. For example, a model that classifies customers as low, medium, or high value would also predict the probability of each classification for each customer. Classification has many applications in customer segmentation, business modeling, marketing, credit analysis, and biomedical and drug response modeling. Different Classification Algorithms Oracle Data Mining provides the following algorithms for classification: Decision Tree Decision trees automatically generate rules, which are conditional statements that reveal the logic used to build the tree. Naive Bayes Naive Bayes uses Bayes' Theorem, a formula that calculates a probability by counting the frequency of values and combinations of values in the historical data. Procedure: 1) Open Weka GUI Chooser. 2) Select EXPLORER present in Applications. 3) Select Preprocess Tab. 4) Go to OPEN file and browse the file that is already stored in the system “bank.csv”. 5) Go to Classify tab. 6) Here the c4.5 algorithm has been chosen which is entitled as j48 in Java and can be selected by clicking the button choose 7) and select tree j48 8) Select Test options “Use training set” 9) if need select attribute. 10) Click Start . 11) Now we can see the output details in the Classifier output. Roll No.___________________ Page 13 St. Martins Engineering College Exp No__ Date____ 12) Right click on the result list and select” visualize tree “option . Sample output: Roll No.___________________ Page 14 St. Martins Engineering College Exp No__ Date____ Roll No.___________________ Page 15 St. Martins Engineering College Exp No__ Date____ EXPERIMENT-4 Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly? (This is also called testing on the training set) Why do you think you cannot get 100 % training accuracy? Aim: To find the percentage of examples that are classified correctly by using the above created decision tree model? ie. Testing on the training set. Tools/ Apparatus: Weka mining tool. Theory: Naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even though these features depend on the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple. An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire covariance matrix The naive Bayes probabilistic model : DESCRIPTION:  Classification Predicts categorical class labels (discrete or nominal) Classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data  Prediction Models continuous-valued functions, i.e., predicts unknown or missing values Typical applications Credit approval Target marketing Medical diagnosis Fraud detection Roll No.___________________ Page 16 St. Martins Engineering College Exp No__ Date____ Training Data Data objects whose class labels are known. Accuracy The Accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier. Procedure: 1) Given the Bank database for mining. 2) Use the Weka GUI Chooser. 3) Select EXPLORER present in Applications. 4) Select Preprocess Tab. 5) Go to OPEN file and browse the file that is already stored in the system “bank.csv”. 6) Go to Classify tab. 7) Choose Classifier “Tree” 8) Select “NBTree” i.e., Navie Baysiean tree. 9) Select Test options “Use training set” 10) if need select attribute. 11) now Start weka. 12) now we can see the output details in the Classifier output. Roll No.___________________ Page 17 St. Martins Engineering College Exp No__ Date____ Sample Output: === Evaluation on training set === === Summary === Correctly Classified Instances 554 92.3333 % Incorrectly Classified Instances 46 7.6667 % Kappa statistic 0.845 Mean absolute error 0.1389 Root mean squared error 0.2636 Relative absolute error 27.9979 % Root relative squared error 52.9137 % Total Number of Instances 600 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.894 0.052 0.935 0.894 0.914 0.936 YES 0.948 0.106 0.914 0.948 0.931 0.936 NO Weighted Avg. 0.923 0.081 0.924 0.923 0.923 0.936 === Confusion Matrix === a b classified as 245 29 a = YES 17 309 b = NO Roll No.___________________ Page 18 St. Martins Engineering College Exp No__ Date____ EXPERIMENT-5 Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly? (This is also called testing on the training set) Why do you think you cannot get 100 % training accuracy? Aim: To create a Decision tree by cross validation training data set using Weka mining tool. Tools/ Apparatus: Weka mining tool.. Theory: Decision tree learning, used in data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value In these tree structures, leaves represent classifications and branches represent conjunctions of features that lead to those classifications. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data but not decisions; Rather the resulting classification tree can be an input for decision making. This page deals with decision trees in data mining. Decision tree learning is a common method used in data mining. The goal is to create a model that predicts the value of a target variable based on several input variables. Each interior node corresponds to one of the input variables; There are edges to children for each of the possible values of that input variable. Each leaf represents a value of the target variable given the values of the input variables represented by the path from the root to the leaf. A tree can be "learned" by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node all has the same value of the target variable, or when splitting no longer adds value to the predictions. In data mining, trees can be described also as the combination of mathematical and computational techniques to aid the description, categorization and generalization of a given set of data. Data comes in records of the form: (x, y) = (x1, x2, x3..., xk, y) Roll No.___________________ Page 19

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.