Question? Leave a message!




Classification and Regression

Classification and Regression
Classification and Regression www.ThesisScientist.comClassification and regression  What is classification What is regression  Issues regarding classification and regression  Classification by decision tree induction  Bayesian Classification  Other Classification Methods  regression www.ThesisScientist.comWhat is Bayesian Classification  Bayesian classifiers are statistical classifiers  For each new sample they provide a probability that the sample belongs to a class (for all classes) www.ThesisScientist.comBayes’ Theorem: Basics  Let X be a data sample (“evidence”): class label is unknown  Let H be a hypothesis that X belongs to class C  Classification is to determine P(HX), the probability that the hypothesis holds given the observed data sample X  P(H) (prior probability), the initial probability  E.g., X will buy computer, regardless of age, income, …  P(X): probability that sample data is observed  P(XH) (posteriori probability), the probability of observing the sample X, given that the hypothesis holds  E.g., Given that X will buy computer, the prob. that X is 31..40, medium income www.ThesisScientist.comBayes’ Theorem  Given training data X, posteriori probability of a hypothesis H, P(HX), follows the Bayes theorem P(XH)P(H) P(H X) P(X)  Informally, this can be written as posteriori = likelihood x prior/evidence  Predicts X belongs to C iff the probability P(C X) 2 i is the highest among all the P(C X) for all the k k classes  Practical difficulty: require initial knowledge of www.ThesisScientist.com many probabilities, significant computational costTowards Naïve Bayesian Classifiers  Let D be a training set of tuples and their associated class labels, and each tuple is represented by an nD attribute vector X = (x , 1 x , …, x ) 2 n  Suppose there are m classes C , C , …, C . 1 2 m  Classification is to derive the maximum posteriori, i.e., the maximal P(C X) i  This can be derived from Bayes’ theorem P(XC )P(C ) i i P(C X) i P(X)  Since P(X) is constant for all classes, only P(C X)P(XC )P(C ) i i i www.ThesisScientist.com needs to be maximizedDerivation of Naïve Bayesian Classifier  A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes): n P(X ) P( )P( )P( )...P( ) C x C x C x C x C i i i i i k 1 2 n k1  This greatly reduces the computation cost: Only counts the class distribution  If A is categorical, P(x C ) is the of tuples in C k k i i having value x for A divided by C ( of k k i, D tuples of C in D) i  If A is continousvalued, P(x C ) is usually k k i computed based on Gaussian distribution with a mean μ and standard deviation σ 2 (x )  1 2 2 g(x,, ) e 2 and P(x C ) is www.ThesisScientist.com k i P(X )g(x , , ) C k C C i i iNBC: Training Dataset age income studentcreditra butiys ng computer =30 high no fair no =30 high no excellent no Class: 31…40 high no fair yes C1:buyscomputer = ‘yes’ 40 medium no fair yes C2:buyscomputer = ‘no’ 40 low yes fair yes 40 low yes excellent no Data sample 31…40 low yes excellent yes X = (age =30, =30 medium no fair no Income = medium, =30 low yes fair yes Student = yes 40 medium yes fair yes Creditrating = Fair) =30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes www.ThesisScientist.com 40 medium no excellent noNBC: An Example  P(C ): P(buyscomputer = “yes”) = 9/14 = 0.643 i P(buyscomputer = “no”) = 5/14= 0.357  Compute P(XC) for each class i P(age = “=30” buyscomputer = “yes”) = 2/9 = 0.222 P(age = “= 30” buyscomputer = “no”) = 3/5 = 0.6 P(income = “medium” buyscomputer = “yes”) = 4/9 = 0.444 P(income = “medium” buyscomputer = “no”) = 2/5 = 0.4 P(student = “yes” buyscomputer = “yes) = 6/9 = 0.667 P(student = “yes” buyscomputer = “no”) = 1/5 = 0.2 P(creditrating = “fair” buyscomputer = “yes”) = 6/9 = 0.667 P(creditrating = “fair” buyscomputer = “no”) = 2/5 = 0.4  X = (age = 30 , income = medium, student = yes, creditrating = fair) P(XC ) : P(Xbuyscomputer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = i 0.044 P(Xbuyscomputer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(XC )P(C ) : P(Xbuyscomputer = “yes”) P(buyscomputer = “yes”) = i i 0.028 www.ThesisScientist.com P(Xbuyscomputer = “no”) P(buyscomputer = “no”) = 0.007play tennis Naive Bayesian Classifier Example Outlook TemperatureHumidityWindy Class sunny hot high false N sunny hot high true N overcast hot high false P rain mild high false P rain cool normal false P rain cool normal true N overcast cool normal true P sunny mild high false N sunny cool normal false P rain mild normal false P sunny mild normal true P overcast mild high true P overcast hot normal false P rain mild high true N www.ThesisScientist.comNaive Bayesian Classifier Example Outlook TemperatureHumidityWindy Class overcast hot high false P rain mild high false P rain cool normal false P overcast cool normal true P sunny cool normal false P 9 rain mild normal false P sunny mild normal true P overcast mild high true P overcast hot normal false P Outlook TemperatureHumidityWindy Class sunny hot high false N sunny hot high true N rain cool normal true N 5 sunny mild high false N www.ThesisScientist.com rain mild high true NNaive Bayesian Classifier Example  Given the training set, we compute the probabilities: Outlook P N Humidity P N sunny 2/9 3/5 high 3/9 4/5 overcast 4/9 0 normal 6/9 1/5 rain 3/9 2/5 Tempreature W indy hot 2/9 2/5 true 3/9 3/5 mild 4/9 2/5 false 6/9 2/5 cool 3/9 1/5  We also have the probabilities  P = 9/14  N = 5/14 www.ThesisScientist.comNaive Bayesian Classifier Example  To classify a new sample X:  outlook = sunny  temperature = cool  humidity = high  windy = false  Prob(PX) = Prob(P)Prob(sunnyP)Prob(coolP) Prob(highP)Prob(falseP) = 9/142/93/93/96/9 = 0.01  Prob(NX) = Prob(N)Prob(sunnyN)Prob(coolN) Prob(highN)Prob(falseN) = 5/143/51/54/52/5 = 0.013 www.ThesisScientist.com  Therefore X takes class label NNaive Bayesian Classifier Example  Second example X = rain, hot, high, false  P(Xp)·P(p) = P(rainp)·P(hotp)·P(highp)·P(falsep)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582  P(Xn)·P(n) = P(rainn)·P(hotn)·P(highn)·P(falsen)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286  Sample X is classified in class N (don’t play) www.ThesisScientist.comAvoiding the 0Probability Problem  Naïve Bayesian prediction requires each conditional prob. be nonzero. Otherwise, the predicted prob. will be zero n P(X ) P( )  C x C i k i k1  Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10),  Use Laplacian correction (or Laplacian estimator)  Adding 1 to each case Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003  The “corrected” prob. estimates are close to their “uncorrected” counterparts www.ThesisScientist.comNBC: Comments  Advantages  Easy to implement  Good results obtained in most of the cases  Disadvantages  Assumption: class conditional independence, therefore loss of accuracy  Practically, dependencies exist among variables  E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.  Dependencies among these cannot be modeled by Naïve Bayesian Classifier  How to deal with these dependencies  Bayesian Belief Networks www.ThesisScientist.comBayesian Belief Networks  Bayesian belief network allows a subset of the variables conditionally independent  A graphical model of causal relationships  Represents dependency among the variables  Gives a specification of joint probability distribution  Nodes: random variables  Links: dependency Y  X and Y are the parents of Z, and Y is X the parent of P  No dependency between Z and P Z P  Has no loops or cycles www.ThesisScientist.comBayesian Belief Network: An Example The conditional probability table Family Smoker History (CPT) for variable LungCancer: (FH, S) (FH, S) (FH, S) (FH, S) LC 0.8 0.7 0.5 0.1 LC LungCancer Emphysema 0.2 0.5 0.3 0.9 CPT shows the conditional probability for each possible combination of its parents Derivation of the probability of a PositiveXRay Dyspnea particular combination of values of X, from CPT: n Bayesian Belief Networks P(x ,...,x ) P( )) Parents(Y  x 1 n i i i1 www.ThesisScientist.comBayesian Belief Networks  Using Bayesian Belief Networks:  P(v , ..., v ) = ΠP(v /Parents(v )) 1 n i i  Example:  P(LC = yes  FH = yes  S = yes) = P(FH = yes) P(S = yes) P(LC = yesFH = yes  S = yes) = P(FH = yes) P(S = yes)0.8 www.ThesisScientist.comTraining Bayesian Networks  Several scenarios:  Given both the network structure and all variables observable: learn only the CPTs  Network structure known, some hidden variables: gradient descent (greedy hillclimbing) method  Network structure unknown, all variables observable: search through the model space to reconstruct network topology  Unknown structure, all hidden variables: No good algorithms known for this purpose www.ThesisScientist.comUsing IFTHEN Rules for Classification  Represent the knowledge in the form of IFTHEN rules R: IF age = youth AND student = yes THEN buyscomputer = yes  Rule antecedent/precondition vs. rule consequent  Assessment of a rule: coverage and accuracy  n = of tuples covered by R covers  n = of tuples correctly classified by R correct coverage(R) = n /D / D: training data set / covers accuracy(R) = n / n correct covers  If more than one rule is triggered, need conflict resolution  Size ordering: assign the highest priority to the triggering rules that has the “toughest” requirement (i.e., with the most attribute test)  Classbased ordering: decreasing order of prevalence or misclassification cost per class  Rulebased ordering (decision list): rules are organized into one long priority list, according to some measure of rule quality or by experts www.ThesisScientist.comRule Extraction from a Decision Tree age  Rules are easier to understand than large =30 31..40 40 trees student credit rating yes  One rule is created for each path from the excellent fair no yes root to a leaf yes no yes  Each attributevalue pair along a path forms a conjunction: the leaf holds the class prediction  Rules are mutually exclusive and exhaustive  Example: Rule extraction from our buyscomputer decisiontree IF age = young AND student = no THEN buyscomputer = no IF age = young AND student = yes THEN buyscomputer = yes IF age = midage THEN buyscomputer = yes IF age = old AND creditrating = excellent THEN buyscomputer = yes IF age = young AND creditrating = fair THEN buyscomputer = no www.ThesisScientist.comInstanceBased Methods  Instancebased learning:  Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified  Typical approaches  knearest neighbor approach  Instances represented as points in a Euclidean space. www.ThesisScientist.comThe kNearest Neighbor Algorithm  All instances correspond to points in the nD space.  The nearest neighbor are defined in terms of Euclidean distance.  The target function could be discrete or real valued.  For discretevalued function, the kNN returns the most common value among the k training examples nearest to xq.  Vonoroi diagram: the decision surface induced by 1NN for a typical set of training examples. . + + . . . + xq . + . www.ThesisScientist.comDiscussion on the kNN Algorithm  Distanceweighted nearest neighbor algorithm  Weight the contribution of each of the k neighbors according to their distance to the query point x q  give greater weight to closer neighbors 1 w 2 d(x ,x )  Similarly, for realvalued target functions q i  Robust to noisy data by averaging knearest neighbors  Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes.  To overcome it, axes stretch or elimination of the least relevant attributes. www.ThesisScientist.com