How does machine learning work

how is machine learning different from statistics and data mining and what can machine learning be used for and what is supervised machine learning techniques
AbbieBenson Profile Pic
AbbieBenson,United States,Professional
Published Date:13-07-2017
Your Website URL(Optional)
Comment
(67577) Introduction to Machine Learning October 19, 2009 Lecture 1 – Introduction Lecturer: Shai Shalev-Shwartz Scribe: Shai Shalev-Shwartz Based on a book by Shai Ben-David and Shai Shalev-Shwartz (in preparation) 1 What is learning? The subject of this course is automated learning, or, as we will more often use, machine learning (ML for short). Roughly speaking, we wish to program computers so that they can ”learn”. Before we discuss how machines can learn, or how the process of learning can be automated, let us con- sider two examples of naturally occurring animal learning. Not surprisingly, some of the most fundamental issues in ML arise already in that context, that we are all familiar with. 1. Bait Shyness - rats learning to avoid poisonous baits. It is well known that, when eating a substance is followed by illness an animal may associate the illness with the eaten substance and avoid that food in the future. This is called conditioned taste aversion or “bait shyness”. For example, when rats encounter novel food items, they will first eat very small amounts, and subsequent feeding will depend on the flavor of the food and its physiological effect. If the food produces an ill effect, the food will often be associated with the illness, and subsequently, the rats will not eat it. Clearly, there is a learning mechanism in play here- if past experience with some food was negatively labeled, the animal predicts that it will also have a negative label when encountered in the future. Naturally, bait shyness is often a useful survival mechanism. 2. Pigeon superstition. Something goes wrong. Another demonstration of naive animal ”learning” is the classic ”pigeon superstition” experiment. In that experiment, the psychologist B.F. Skinner placed a bunch of hungry pigeons in a cage attached to an automatic mechanism that delivered food to the pigeons ”at regular intervals with no reference whatsoever to the bird’s behavior.” What happens in such an experiment is that the hungry pigeons go around the cage pecking at random objects. When food is delivered, it finds each pigeon pecking at some object. Consequently, each bird tends to spend some more pecking time at that lucky object. That, in turn, increases the chance that the next random food spraying will find each bird at that location of hers. What results is chain of events that reinforces the pigeons’ association of the delivery of the food with whatever chance actions they had been performing when it was first delivered. They subsequently continue to perform these same actions diligently (see http://psychclassics.yorku.ca/Skinner/Pigeon/). What distinguishes learning mechanisms that result in superstition from useful learning? This question is crucial to the development of automated learners. While human learners can rely on common sense to filter out random meaningless learning conclusions, once we export the task of learning to a program, we must pro- vide well defined crisp principles that will protect the program from reaching senseless/useless conclusions. The development of such principles is a central goal of the theory of machine learning. As a first step in this direction, let us have a closer look at the bait shyness phenomenon in rats. 3. Bait Shyness revisited: rats fail to acquire conditioning between food and electric shock or be- tween sound and nausea. The bait shyness mechanism is rats turn out to be more complex than what one may expect. In experiments carried out by Garcia, it was demonstrated that if the unpleas- ant stimulus that follows food consumption is replaced by, say, electrical shock (rather than nausea), then no conditioning occurs. That is, even after repeated trials in which the consumption of some food is followed by the administration of unpleasant electrical shock, the rats do not tend to avoid that food. Similar failure of conditioning occurs when the characteristics of the food that implies nausea 1 – Introduction-1is replaced by a vocal signal (rather than the taste or smell of that food). Clearly, the rats have some “built in” prior knowledge telling them that, while temporal correlation between food and nausea can be causal, it is unlikely that there will be a causal relationship between food consumption and electrical shocks. Now, we can observe that one distinguishing feature between the bait shyness learning and the pigeon superstition is the incorporation of prior knowledge that biases the learning mechanism. The pigeons in the experiment are willing to adopt any explanation to the occurrence of food. However, the rats ”know” that food cannot cause an electric shock and that the co-occurrence of noise with some food is not likely to effects the nutritional value of that food. The rats learning process is biased towards detecting some kind of patterns while ignoring other temporal correlations between events. It turns out that the incorporation of prior knowledge, biasing the learning process, is inevitable for the success of learning algorithm (this is formally stated and proved as the ”No Free Lunch theorem”). The development of tools for expressing domain expertise, translating it into a learning bias, and quantifying the effect of such a bias on the success of learning, is a central theme of the theory of machine learning. Roughly speaking, the stronger the prior knowledge (or prior assumptions) that one starts the learning process with, the easier it is to earn from further examples. However, the stronger these prior assumptions are, the less flexible the learning is - it is bound, a priory, by the commitment to these assumptions. We shall discuss these issues explicitly later in the next lecture. 2 Why automating the process of learning? Computers are being applied to almost every aspect of our lives. However, with all their innumerable success- ful applications, there are several inherent limitations to what humanly-written program can do. One limiting factor is the need for explicit, detailed, specification; in order to write a program to carry out some function, the programmer needs to fully understand, down to the level of the smallest details, how that function should be executed. Computers only follow the program’s instructions and have no way of handling ambiguity or to adapt to scenarios that are not explicitly addressed by the program’s instructions. In contrast to that, taking example from intelligent beings, many of our skills are acquired or refined through learning from our ex- perience (rather than following explicit instructions given to us). Machine learning concerns with endowing programs with the ability to ”learn” and adapt. Many tasks currently carried out automatically utilize machine learning components. Consider for example the task of driving. It would have been extremely useful to have reliable automated drivers. They would never doze off, never get drunk or angry at the other driver, they could be sent out to dangerous roads without risking human lives and so on and on. Why don’t we see automated drivers around us on the roads (at least not in 2010 when this handouts are being written)? It does not seem to require much intelligence to drive, computers are already routinely performing way more sophisticated tasks. The obstacle is exactly in the gap between what we can provide exact instructions for and what we have to rely on experience to shape up and refine. While we all drive, we acquire much of driving skills through practice. We do not understand our driving decision-making well enough to be able to translate our driving skills into a computer program. Recent significant progress in the development of automated drivers relies on programs that can improve with experience - machine learning programs. Automated driving, medical research, natural language processing (including speech recognition and machine translation), credit card fraud detection, are only a few of the many areas where machine learning is playing a central role in applying computers to substitute for human ”thinking” and decision making. Two aspects of a problem may call for the use of programs that learn and improve based on their ”expe- rience”; the problem’s complexity and its ”adaptivity”. Tasks that are too complex to program. 1 – Introduction-2 Tasks performed by animals/humans: there are numerous tasks that, although we perform rou- tinely, our introspection, concerning how we do them, is not sufficiently elaborate to extract a well defined program. Examples of such tasks include driving, speech recognition, and face recognition. In all of these tasks, state of the art ML programs, programs that ”learn from their experience”, achieve quite satisfactory results, once exposed to sufficiently many training exam- ples.  Tasks beyond human capabilities: another wide family of tasks that benefit from machine learn- ing techniques are related to the analysis of very large and complex data sets: Astronomical data, turning medical archives into medical knowledge, weather prediction, analysis of genomic data, web search engines, and electronic commerce. With more and more available electronically recorded data, it becomes obvious that there are treasures of meaningful information buried in data archives that are way too large and too complex for humans to make sense of. Learning to detect meaningful patterns in large and complex data sets is a promising domain in which the combination of programs that learn with the almost unlimited memory capacity and processing speed of computers open up new horizons. Cope with diversity (Adaptivity). One limiting feature of programmed tools is their rigidity - once the pro- gram has been written down and installed, it stays unchanged. However, many tasks change over time or from one user to another in a way that requires the way we handle them to adapt. Machine learning tools - programs whose behavior adapts to their input data - offer a solution to such issues; they are, by nature, adaptive to changes in the environment they interact with. Typical successful applications of machine learning to such problems include programs that decode hand written text, where a fixed program can adapt to variations between the handwriting of different users, spam detection programs, adapting automatically to changes in the nature of spam emails, and speech recognition programs (again, a scenario in which a fixed program is required to handle large variability in the type on inputs it is applied to). 2.1 Types of learning Learning is, of course, a very wide domain. Consequently, the field of machine learning has branched into several subfields dealing with different types of learning tasks. We give a rough taxonomy of learn- ing paradigms, aiming to provide some perspective of where the content of this course sits within the wide field of machine learning. The first classification we make regards the goal of the learning process. In the discriminative learning setting, the learner uses past experience in order to predict properties of future examples. That is, learning can be defined as the process of using experience to become an expert. The goal of the learning process should be well defined in advance, before seeing any examples. Discriminative learning is convenient because there is a clear measure of success – how good the predictions of the learner on future examples are. In non-discriminative learning, the goal of the learner is not well defined in advance. The learner aims to “understand” the data e.g. by learning a generative probabilistic model that fits the data. In this case, learning can be described as the process of finding meaningful simplicity in the midst of disorderly complexity. In this course we mostly deal with discriminative learning. Next, we describe four parameters along which learning paradigms can be further classified. Supervised vs. Unsupervised Since learning involves an interaction between the learner and the environ- ment, one can divide learning tasks according to the nature of that interaction. The first distinction to note is the difference between supervised and unsupervised learning. As an illustrative example, consider the task of learning to detect spam email versus the task of anomaly detection. For the spam detection task, we consider a setting in which the learner receives training emails for which the label spam/not-spam is provided. Based on such training the learner should figure out a rule for labeling a newly arriving email message. In contrast, for the task of anomaly detection, all the learner gets as training is a large body of email messages and the learner’s task is to detect ”unusual” messages. 1 – Introduction-3More abstractly, viewing learning as a process of ”using experience to gain expertise”, supervised learning describes a scenario in which the ”experience”, a training example, contains significant infor- mation that is missing in the ”test examples” to which the learned expertise is to be applied (say, the Spam/no-Spam labels). In this setting, the acquired expertise is aimed to predict that missing infor- mation for the test data. In such cases, we can think of the environment as a teacher that ”supervises” the learner by providing the extra information (labels). In contrast with that, in unsupervised learning, there is no distinction between training and test data. The learner processes input data with the goal of coming up with some summary, or compressed version of that data. Clustering a data set into subsets of similar objets is a typical example of such a task. There is also an intermediate learning setting in which, while the training examples contain more information than the test examples, the learner is required to predict even more information for the test examples. For example, one may try to learn a value function, that describes for each setting of a chess board the degree by which White’s position is better than the Black’s. Such value functions can be learned based on a data base that contains positions that occurred in actual chess games, labeled by who eventually won that game. Such learning framework are mainly investigated under the title of ‘reinforcement learning’. Active vs. Passive learners Learning paradigms can vary by the role played by the learner. We distinguish between ‘active’ and ‘passive’ learners. An active learner interacts with the environment at training time, say by posing queries or performing experiments, while a passive learner only observes the information provided by the environment (or the teacher) without influencing or directing it. Note that, the learner of a spam filter is usually passive - waiting for users to mark the emails arriving to them. In an active setting, one could imagine asking users to label specific emails chosen by the learner, or even composed by the learner to enhance its understanding of what spam is. Helpfulness of the teacher When one thinks about human learning, of a baby at home, or a student at school, the process often involves a helpful teacher. A teacher trying to feed the learner with the information most useful for achieving the learning goal. In contrast, when a scientist learns about na- ture, the environment, playing the role of the teacher, can be best thought of as passive - apples drop, stars shine and the rain falls without regards to the needs of the learner. We model such learning sce- narios by postulating that the training data (or the learner’s experience) is generated by some random process. This is the basic building block in the branch of ‘statistical learning’. Finally, learning also occurs when the learner’s input is generated by an adversarial “teacher”. This may be the case in the spam filtering example (if the spammer makes an effort to mislead the spam filtering designer) or in learning to detect fraud. One also uses an adversarial teacher model as a worst-case-scenario, when no milder setup can be safely assumed. If you can learn against an adversarial teacher, you are guaranteed to succeed interacting any odd teacher. Online vs. Batch learning protocol The last parameter we mention is the distinction between situations in which the learner has to respond online, throughout the learning process, to settings in which the learner has to engage the acquired expertise only after having a chance to process large amounts of data. For example, a stock broker has to make daily decisions, based on the experience collected so far. He may become an expert over time, but might have made costly mistakes in the process. In contrast, in many data mining settings, the learner - the data miner - has large amounts of training data to play with before having to output conclusions. In this course we shall discuss only a subset of the possible learning paradigms. Our main focus is on supervised statistical batch learning with a passive learner (like for example, trying to learn how to generate patients’ prognosis, based on large archives of records of patients that were independently collected and are already labeled by the fate of the recorded patients). We shall also briefly discuss online learning and batch unsupervised learning (in particular, clustering). Maybe the most significant omission here, at least from the point of view of practical machine learning, is that this course does not address reinforcement learning. 1 – Introduction-4(67577) Introduction to Machine Learning October 20, 2009 Lecture 2 – A gentle start Lecturer: Shai Shalev-Shwartz Scribe: Shai Shalev-Shwartz Based on a book by Shai Ben-David and Shai Shalev-Shwartz (in preparation) In this lecture we give a first formal treatment of learning. We focus on a learning model called the PAC model. We aim at rigorously showing how data can be used for learning as well as how overfitting might happen if we are not careful enough. 3 A statistical learning framework Imagine you have just arrived in some small Pacific island. You soon find out that papayas are a significant ingredient in the local diet. However, you have never before tasted papayas. You have to learn how to predict whether a papaya you see in the market is tasty or not. There are two obvious features that you can base your prediction on; the papaya’s color, ranging from dark green, through orange and red to dark brown, and there is the papaya’s softness, ranging from rock hard to mushy. Your input for figuring out your prediction rule is a sample of papayas that you have examined for color and softness and then tasted and found out if they were tasty or not. This is a typical learning problem. 3.1 A Formal Model The Learner’s Input: In the basic statistical learning setting, the learner has access to the following: Domain Set: An arbitrary set,X . This is the set of objects that we may wish to label. For example, these could be papayas that we wish to classify astasty ornot-tasty, or email messages that we wish to classify as spam or not-spam. Usually, these domain points will be represented by a vector of features (like the papaya’s color, softness etc.). Label Set: For our discussion, we will restrictY to be a two-element set, usually,f0; 1g orf1; +1g. In our papayas example, let +1 represents being tasty and1 being not-tasty. Training data: S = ((x ;y )::: (x ;y )) is a finite sequence of pairs inXY. That is, a sequence 1 1 m m of labeled domain points. This is the input that the learner has access to (like a set of papayas that have been tasted and their tastiness recorded). The Learner’s Output: The learner outputs a hypothesis or a prediction rule,h :X Y. This function can be then used to predict the label of new domain points. In our papayas example, it is the rule that our learner will employ to decide whether a future papaya she examines in the farmers market is going to be tasty or not. A Measure of success: We assume that the data we are interested in (the papayas we encounter) is gener- ated by some probability distribution (say, the environment). We shall define the quality of a given hypothesis as the chance that it has to predict the correct label for a data point that is generated by that underlying distribution. Equivalently, the error of a hypothesis quantifies how likely it is to make an error when labeled points are randomly drawn according to that data-generating distribution. Formally, we model the environment as a probability distribution,D, overXY. Intuitively,D(x;y) determines how likely is it to observe a and define the error of a classifier to be: def def err (h) = P h(x)6=y = D(f(x;y) :h(x) =6 yg): (1) D (x;y)D 2 – A gentle start-5That is, the error of a classifierh is the probability to randomly choose an example (x;y) for which h(x) =6 y. The subscriptD reminds us that the error is measured with respect to the probability distributionD over our “world”,XY. We omit this subscript when it is clear from the context. err (h) has several synonymous names such as the generalization error, the test error, or the true D error ofh. Note that this modeling of the world allows sampling the same instance with different labels. In the papayas example, this amounts to allowing two papayas with the same color and softness such that one of them is tasty and the other is not. In some situations, the labels are determined deterministically once the instance is fixed. This scenario can be derived from the general model as a special case by imposing the additional requirement that the conditional probability (according toD) to see a labely given an instancex is either 1 or 0. i.i.d. assumption: The learner is blind to the underlying distributionD over the world. In our papayas example, we have just arrived to a new island and we have no clue as to how papayas are distributed. The only way the learner can interact with the environment is through observing the training set. Of course, to facilitate meaningful learning, there must be some connection between the examples in the training set and the underlying distribution. Formally, we assume that each example in the training set is independently and identically distributed (i.i.d.) according to the distributionD. Intuitively, the training setS is a window throughout we look at the distributionD over the world. 4 Empirical Risk Minimization Recall that a learning algorithm receives as input a training setS, sampled i.i.d. from an unknown distribution D, and should output a predictorh :X Y, where the subscriptS emphasizes the fact that the output S predictor depends on S. The error of h is the probability that h errs on a random example sampled S S according toD. Since we are blind to the distribution over the world, a natural and intuitive idea is to choose one of the predictors that makes a minimal number of mistakes on the training set,S. This learning paradigm is called Empirical Risk Minimization or ERM for short. Formally, We denote the average number of mistakes a predictor makes on a training setS by jfi :h(x ) =6 ygj i i err (h) = : S m This quantity is also called the empirical risk ofh or the training error ofh. Then, the ERM rule finds a predictorh that minimizes err (h). Note that there may be several predictors that minimize the training error S and an ERM algorithm is free to choose any such predictor. 4.1 Something goes wrong Although the ERM rule seems like a natural learning algorithm, as we show next, without being careful this approach can miserably fail. Recall again the problem of learning the taste of a papaya based on its shape and color. Consider an i.i.d. sample ofm examples as depicted in Figure 1. Assume that the probability distributionD is such that instances are distributed uniformly within the gray square and the labels are determined deterministically to be 1 if the instance is within the blue square and 0 otherwise. The area of the gray square in the picture is 2 and the area of the blue square is 1. Consider the following predictor: ( y if9i s:t:x =x i i h (x) = : (2) S 0 otherwise Clearly, no matter what the sample is, err (h ) = 0, and therefore the classifier may be chosen by an ERM S S algorithm since it is one of the empirical-minimum cost hypotheses. On the other hand, it is clear that the 2 – A gentle start-6generalization error of any classifier that predicts the label 1 only on a finite number of instances is 1=2. Thus, err (h ) = 1=2. This is a clear example of overfitting. We found a predictor whose performance on the D S training set is excellent but whose performance on the true world is very bad. Intuitively, overfitting occurs when we can explain every set of examples. The explanations of someone that can explain everything are suspicious. Figure 1: An illustration of a sample for the Papaya taste learning problem. 5 Empirical Risk Minimization with inductive bias In the example shown previously, we showed that the ERM rule might lead to overfitting – we found a predictor that has excellent performance on the training set but has a very bad performance on the underlying distribution. How can we avoid overfitting? A possible solution is to apply the ERM learning rule, but restrict the search space. Formally, the learner should choose in advance (before seeing the data) a set of predictors. This set is called a hypothesis class and is denoted byH. That is, eachh2H is a function mapping fromX toY. After deciding onH, the learner samples a training setS and uses the ERM rule to choose a predictor out of the hypothesis class. The learner may choose a hypothesish2H, which minimizes the error over the training set. By restricting the learner to choose a predictor fromH we bias it toward a particular set of predictors. This preference is often called an inductive bias. SinceH is chosen in advance we refer to it as a prior knowledge on the problem. A fundamental question in learning theory, which we will study later in the course, is what hypothesis classes guarantee learning without overfitting. That is, how the restriction of the search space to those pre- dictors inH saves us from overfitting. Intuitively, choosing a more restricted hypothesis class better protects us against overfitting but at the same time might cause us a larger inductive bias. We will get back to this fundamental tradeoff later. Next, we show how a restriction of the search space to a finite hypothesis class prevents overfitting. 5.1 A finite hypothesis class LetH be a finite hypothesis class. For example,H can be the set of all predictors that can be implemented by a C++ program whose length is at mostk bits. Or, in our papayas example,H can be the set of all rectangles whose coordinates are taken from a grid. The learning algorithm is allowed to use the training set for deciding which predictor to choose from the hypothesis classH. In particular, we will analyze the performance of the “biased” ERM learning rule: h 2 argmin err (h); (3) S S h2H where ties are broken in some arbitrary way. To simplify the analysis of the biased ERM rule we make one additional assumption (that will be relaxed later in this course). ? ? Realizable assumption: Existsh 2H s.t. err (h ) = 0. This assumption implies that for any training set D ? S we have err (h ) = 0 with probability 1. S 2 – A gentle start-7From the realizable assumption and the definition of the ERM rule given in Eq. (3), we have that err (h ) = 0 (with probability 1). But, what we are interested in is the generalization error of h , that S S S is err (h ). Since err (h ) depends on the training set it is a random variable and therefore we will analyze D S D S the probability to sample a training set for which err (h ) is not too large. Formally, let be an accuracy D S parameter, where we interpret the event err (h )  as a severe overfitting, while if err (h )  we D S D S accept the output of the algorithm to be an approximately correct predictor. Therefore, we are interested in calculating P err (h ): D S m SD LetH be the set of “bad” hypotheses, that isH =fh2H : err (h) g. As mentioned previously, B B D the realizable assumption implies that err (h ) = 0 with probability 1. This also implies that the event S S err (h ) can only happen if for someh2H we have err (h) = 0. Therefore, the setfS : err (h ) D S B S D S g is contained in the setfS :9h2H ; err (h) = 0g and thus, B S P err (h )  P 9h2H : err (h) = 0: (4) D S B S m m SD SD Next, we upper bound the right-hand side of the above using the union bound, whose proof is trivial. Lemma 1 (Union bound) For any two setsA;B we have PABPA +PB: Since the setfS :9h2H ; err (h) = 0g can be written as fS : err (h) = 0g, we can apply the B S h2H S B union bound on the right-hand side of Eq. (4) to get that X P err (h )  P err (h) = 0: (5) D S S m m SD SD h2H B Next, let us bound each summand of the right-hand side of the above. Fix some bad hypothesish2H . For B each individual element of the training set we have, P h(x ) =y = 1 err (h) 1: i i D (x ;y )D i i Since the examples in the training set are sampled i.i.d. we get that for allh2H B m Y m P err (h) = 0 = P 8i;h(x ) =y = P h(x ) =y  (1) : (6) S i i i i m m SD SD (x ;y )D i i i=1  Combining the above with Eq. (5) and using the inequality 1e we conclude that m m P err (h )  jH j (1)  jHje : D S B m SD Corollary 1 LetH be a finite hypothesis class. Let2 (0; 1) and 0 and letm be an integer that satisfies log(jHj=) m :  Then, for any distributionD, for which the realizable assumption holds, with probability of at least 1 over the choice of an i.i.d. sampleS of sizem we have err (h ): D S A graphical illustration which explains how we used the union bound is given in Figure 2. 2 – A gentle start-8Figure 2: Each point in the large circle represents a possible training sets ofm examples. Each colored area represents ’bad’ training sets for some bad predictorh2H , that isfS : err (h)  err (h) = 0g. B D S The ERM can potentially overfit whenever it gets a training setS which is bad for someh2H . Eq. (6) B m guarantees that for each individualh2H , at most (1) -fraction of the training sets will be bad. Using B the union bound we bound the fraction of training sets which are bad for someh2H . This can be used to B bound the set of predictors which might lead the ERM rule to overfit. 6 PAC learning In the previous section we showed that if we restrict the search space to a finite hypothesis class then the ERM rule will probably find a classifier whose error is approximately the error of the best classifier in the hypothesis class. This is called Probably Approximately Correct (PAC) learning. Definition 1 (PAC learnability) A hypothesis classH is PAC learnable if for any 0;2 (0; 1) there existsm = poly(1=; 1=) and a learning algorithm such that for any distributionD overXY, which satisfies the realizability assumption, when running the learning algorithm onm i.i.d. examples it returns h2H such that with probability of at least 1, err (h). D Few remarks:  The definition of Probably Approximately Correct learnability contains two approximation parameters. The parameter  is called the accuracy parameter (corresponds to “approximately correct”) and the parameter is called the confidence parameter (corresponds to “probably”). The definition requires that the number of examples that is required for achieving accuracy with confidence 1 is polynomial in 1= and in 1=. This is similar to the definition of Fully Polynomial Approximation Schemes (FPTAS) in the theory of computation.  We require the learning algorithm to succeed for any distributionD as long as the realizability assump- tion holds. Therefore, in this case the prior knowledge we have on the world is encoded in the choice of the hypothesis space and in the realizability assumption with respect to this hypothesis space. Later, we will describe the agnostic PAC model in which we relax the realizability assumption as well. We will also describe learning frameworks in which the guarantees do not hold for any distribution but instead only hold for a specific parametric family of distributions, which implies a much stronger prior belief on the world.  In the previous section we showed that a finite hypothesis class is PAC learnable. But, there are infinite classes that are learnable as well. Later on we will show that what determines the PAC learnability of a class is not its finiteness but rather a combinatorial measure called VC dimension. 2 – A gentle start-9(67577) Introduction to Machine Learning October 26, 2009 Lecture 3 – The Bias-Complexity Tradeoff Lecturer: Shai Shalev-Shwartz Scribe: Shai Shalev-Shwartz Based on a book by Shai Ben-David and Shai Shalev-Shwartz (in preparation) In the previous lecture we showed that a finite hypothesis class is learnable in the PAC model. The PAC model assumes that there exists a perfect hypothesis (in terms of generalization error) in the hypothesis class. But, what if this assumption does not hold? In this lecture we present the agnostic PAC model in which we do not assume the existence of a perfect hypothesis. We analyze the learnability of a finite hypothesis class in the agnostic PAC model and by doing so demonstrate the bias-complexity tradeoff, a fundamental concept in machine learning which analyzes the tension between overfitting and underfitting. 7 Agnostic PAC learning In the previous lecture we defined the PAC learning model. We now define the agnostic PAC learning model, in which the realizability assumption is not required. Clearly, we cannot hope that the learning algorithm will find a hypothesis whose error is smaller than the minimal possible error, min err (h). Instead, we h2H D require that the learning algorithm will find a predictor whose error is not much larger than the best possible error of a predictor in the hypothesis class. Definition 2 (agnostic PAC learnability) A hypothesis classH is agnostic PAC learnable if for any  0;2 (0; 1) there existsm = poly(1=; 1=) and a learning algorithm such that for any distributionD over XY, when running the learning algorithm onm i.i.d. training examples it returnsh2H such that with probability of at least 1 (over the choice of them training examples), 0 err (h) min err (h ) +: D D 0 h2H In this lecture we will prove that a finite hypothesis class is agnostic PAC learnable. To do so, we will show that there existsm = poly(1=; 1=) such that with probability of at least 1, for allh2H we have thatjerr (h) err (h)j=2. This type of property is called uniform convergence. The following simple D S lemma tells us that whenever uniform convergence holds the ERM learning rule is guaranteed to return a good hypothesis. Lemma 2 LetD be a distribution, S be a training set, andH be a hypothesis class such that uniform convergence holds, namely 8h2H;jerr (h) err (h)j  =2: D S Leth 2 arg min err (h) be an ERM hypothesis. Then, S h2H S err (h )  min err (h) +: D S D h2H Proof Sinceh is an ERM hypothesis, we have that for allh2H, S     err (h ) err (h ) +  err (h) +  err (h) + + = err (h) +: D S S S S D D 2 2 2 2 The first and third inequalities are because of the uniform convergence and the second inequality is because h is an ERM predictor. The lemma follows because the inequality holds for allh2H. S The above lemma tells us that uniform convergence is a sufficient condition for agnostic PAC learnability. Therefore, to prove that a finite hypothesis class is agnostic PAC learnable it suffices to establish that uniform 3 – The Bias-Complexity Tradeoff-10convergence holds for a finite hypothesis class. To show that uniform convergence holds we follow a two step argument. First, we will argue thatjerr (h) err (h)j is likely to be small for any fixed hypothesis, which D S is chosen in advance prior to the sampling of the training set. Since err (h) is the expected value of err (h), D S the quantityjerr (h) err (h)j is the deviation of err (h) from its expectation. When this deviation is D S S likely to be small we say that the measure of err (h) is concentrated. Second, we will apply the union bound S (similar to the derivation in the previous lecture) to show that with a slightly larger training set, the random variable err (h) is concentrated around its mean uniformly for all hypotheses inH. S 8 Measure Concentration Let Z ;:::;Z be an i.i.d. sequence of random variables and let  be their mean. In the context of this 1 m chapter, one can think on Z as being the random variablejh(x )yj. The law of large numbers states i i i P m 1 that when m tends to infinity, the empirical average, Z , converges to the expected value , with i m i=1 probability 1. Measure concentration inequalities quantify the deviation of the empirical average from the expectation whenm is finite. We start with an inequality which is called Markov’s inequality. LetZ be a non-negative random variable. The expectation ofZ can be written as follows (see Exercise ?): Z 1 EZ = PZx: (7) x=0 SincePZx is monotonically non-increasing we obtain Z Z a a 8a 0; EZ PZx PZa =aPZa: (8) x=0 x=0 Rearranging the above yields Markov’s inequality: EZ 8a 0; PZa  : (9) a 2 Applying Markov’s inequality on the random variable (ZEZ) we obtain Chebyshev’s inequality: VarZ 2 2 PjZEZja =P(ZEZ) a  ; (10) 2 a 2 where VarZ =E(ZEZ) is the variance ofZ. P m 1 Next, we apply Chebyshev’s inequality on the random variable Z . SinceZ ;:::;Z are i.i.d. i 1 m m i=1 we know that (see Exercise ?) " m X VarZ 1 1 Var Z = : i m m i=1 From the above, we obtain the following: Lemma 3 Let Z ;:::;Z be a sequence of i.i.d. random variables and assume that EZ =  and 1 m 1 VarZ  1. Then, for any2 (0; 1), with probability at least 1 we have 1 r m X 1 1 Z   : i m m i=1 Proof Applying Chebyshev’s inequality we obtain that for alla 0 " m X VarZ 1 1 1 P Z  a   : i m 2 2 ma ma i=1 3 – The Bias-Complexity Tradeoff-11The proof follows by denoting the right-hand side and solving fora. Lets apply the above lemma for the random variablesZ =jh(X )Yj. SinceZ 2 0; 1 we clearly i i i i have that VarZ  1. Choose = 0:1, we obtain that for 90% of the training sets we will have i r 10 jerr (h) err (h)j : S D m In other words we can use err (h) to estimate err (h) and the estimate will be Probably Approximately S D Correct, as long asm is sufficiently large. We can also ask how many examples are needed in order to obtain accuracy  with confidence 1. Based on Lemma 3 we obtain that if 1 m  2  then with probability of at least 1 the deviation between the empirical average and the mean is at most. The deviation between the empirical average and the mean given above depends polynomially on the confidence parameter. It is possible to obtain a significantly milder dependence, and this will turn out to be crucial in later sections. We conclude this section with another measure concentration inequality due to Hoeffding in which the dependence on 1= is only logarithmic. Lemma 4 (Hoeffding’s inequality) LetZ ;:::;Z be a sequence of i.i.d. random variables and assume 1 m thatEZ = andPaZ b = 1. Then, for any 0 1 1 " m X  1 2 2 P Z    2 exp 2m =(ba) : i m i=1 The proof is left as an exercise. The advantage of Lemma 4 over Lemma 3 stems from the fact that in the former our confidence improves exponentially fast as the number of examples increases. Applying Hoeffding’s inequality (Lemma 4) with the random variablesZ =jh(x )yj we obtain: i i i Corollary 2 Leth :Xf0; 1g be an arbitrary predictor and let 0 be an accuracy parameter. Then, for any distributionD overXf0; 1g we have  2 P jerr (h) err (h)j  2 exp 2m : S D m SD 9 Finite classes are agnostic PAC learnable We are now ready to prove that a finite hypothesis class is agnostic PAC learnable. Theorem 1 LetH be a finite hypothesis class. Let2 (0; 1) and 0 and letm be an integer that satisfies 2 log(2jHj=) m : 2  Then, for any distributionD, with probability of at least 1 over the choice of an i.i.d. training setS of sizem we have err (h )  min err (h) +: D S D h2H Proof From Lemma 2 we know that it suffices to prove that with probability of at least 1, for allh2H we havejerr (h) err (h)j=2. In other words, D S    P 9h2H;jerr (h) err (h)j  : D S 2 m SD 3 – The Bias-Complexity Tradeoff-12Using the union bound we have X       P 9h2H;jerr (h) err (h)j  P jerr (h) err (h)j : D S D S 2 2 m m SD SD h2H 2 Using Corollary 2 we know that each summand on the left-hand side of the above is at most 2 exp(m =2). Therefore,   2  P 9h2H;jerr (h) err (h)j  2jHj exp(m =2): D S 2 m SD Combining the above with the requirement onm we conclude our proof. It is interesting to compare Theorem 1 to Corollary 1. Clearly, Theorem 1 is more general. On the other hand, whenever the realizable assumption holds, both Theorem 1 to Corollary 1 guarantee that err (h ) D S 2 but the number of training examples required in Theorem 1 is larger – it grows like 1= while in Corollary 1 it grows like 1=. We note in passing that it is possible to interpolate between the two results but this is out of our scope. 10 Error decomposition Learning is about replacing expert knowledge with data. In this lecture we have shown how data can be used to estimate the error of a hypothesis using a set of examples. In the previous lecture we also saw that without being careful, the data can mislead us and in particular we might suffer from overfitting. To overcome this problem, we restricted the search space to a particular hypothesis classH. How should we chooseH? To answer this question we decompose the error of the ERM predictor into:  The approximation error—the minimum generalization error achievable by a predictor in the hy- pothesis class. The approximation error does not depend on the sample size, and is determined by the hypothesis class allowed. A larger hypothesis class can decrease the approximation error.  The estimation error—the difference between the approximation error and the error achieved by the ERM predictor. The estimation error is a result of the training error being only an estimate of the generalization error, and so the predictor minimizing the training error being only an estimate of the predictor minimizing the generalization error. The quality of this estimation depends on the training set size and the size, or complexity, of the hypothesis class. Formally, leth be an ERM predictor, then S err (h ) = + where :  = min err (h);  = err (h ) : (11) D S app est app D est D S app h2H The first term, the approximation error, measures how much error we have because we restrict ourselves to the hypothesis classH. This is the inductive bias we add. It does not depend on the size of the training set. The second term, the estimation error, measures how much extra error we have because we learn a classifier based on a finite training setS instead of knowing the distributionD. As we have shown, for a finite hypothesis class,  increases withjHj and decreases withm. We can think on the size ofH as how complexH is. est Later in this course we will define other complexity measures of hypothesis classes. Since our goal is to minimize the total generalization error, we face a tradeoff. On one hand, choosingH to be a very rich class decreases the approximation error but at the same time increases the estimation error, as a richH might lead to overfitting. On the other hand, choosingH to be a very small set reduces the estimation error but might increase the approximation error, or in other words, might lead to underfitting. Of course, a great choice forH is the class that contains only one classifier – the Bayes optimal classifier (this is the classifier whose generalization error is minimal – see exercise for details). But, the Bayes optimal classifier 3 – The Bias-Complexity Tradeoff-13depends on the underlying distributionD, which we do not know, and in fact learning was unnecessary had we knewD. Learning theory studies how rich we can makeH while still maintaining reasonable estimation error. In many cases, empirical research focuses on designing good hypothesis classes for a certain domain. The idea is that although we are not experts and do not know how to construct the optimal classifier, we still have some prior knowledge on the specific problem at hand which enables us to design hypothesis classes for which both the approximation error and the estimation error are not too large. Getting back to our Papayas example, we do not know how exactly the color and smoothness of a Papaya predicts its taste, but we do know that Papaya is a fruit and based on previous experience with other fruits we conjecture that a rectangle may be a good predictor. We use the term bias-complexity tradeoff to denote the tradeoff between approximation error and estimation error. 11 Concluding Remarks To summarize, in this lecture we introduced the agnostic PAC learning model and showed that finite hypoth- esis classes are learnable in this model. We made several assumptions, some of them are arbitrary, and we now briefly mention them.  Why modeling the environment as a distribution and why interaction with the environment is by sam- pling — this is a convenient model which is adequate to some situations. We will meet other learning models in which there is no distributional assumption and the interaction with the environment is dif- ferent.  Why binary classifiers — the short answer is simplicity. In binary classification the definition of error is very clear. We either predict the label correctly or not. Many of the results we will discuss in the course hold for other type of predictors as well.  Why considering only ERM — the ERM learning rule is very natural. Nevertheless, in some situations we will discuss other learning rules. We will show that in the agnostic PAC model, if a class if learnable then it is learnable with the ERM rule.  Why distribution free learning — Our goal was to make as few assumptions as possible. The agnostic PAC model indeed make only few assumptions. There are popular alternative models of learning in which we make rather strong distributional assumptions. For example, generative models, which we discuss later in the course.  Why restricting hypothesis space — we will show that there are other ways to restrict the search space and avoid overfitting.  Why finite hypothesis classes — we will show learnability results with infinite classes in the next lectures  We ignore computational issues — for simplicity. In some cases, computational issues make the ERM learning rule infeasible. 12 Exercise 1. The Bayes error: Among all classifiers, the Bayes optimal classifier is the one which has the lowest generalization error and the Bayes error is its generalization error. The error of the Bayes optimal classifier is due to the intrinsic non-determinism of our world as reflected inD. In particular, ify is deterministically set given x then err (h ) = 0. The Bayes optimal classifier depends on the D Bayes 3 – The Bias-Complexity Tradeoff-14distributionD, which is unknown to us. Had we known to construct the Bayes optimal classifier, we wouldn’t need to learn anything. Show that the Bayes optimal classifier is: h (x) = argmaxPyjx: (12) Bayes y2Y 2. Prove Eq. (7) 3. Calculate the variance of sums of independent random variables 4. Prove Hoeffding’s inequality 3 – The Bias-Complexity Tradeoff-15(67577) Introduction to Machine Learning November 2, 2009 Lecture 4 – Minimum Description Length Lecturer: Shai Shalev-Shwartz Scribe: Shai Shalev-Shwartz Based on a book by Shai Ben-David and Shai Shalev-Shwartz (in preparation) In the previous lecture we saw that finite hypothesis classes are learnable in the agnostic PAC model. In the case of a finite set of hypotheses we established a uniform convergence results — the training error is close to the generalization error for all hypotheses in the finite class. In this lecture we will see that it is possible to learn infinite hypothesis classes even though uniform convergence does not hold. Instead of requiring the algorithm to choose a hypothesis from a finite hypothesis class, we will define weights of hypotheses. The idea is to divide the uncertainty of the sampling error unevenly among the different hypotheses, such that the estimation error will be smaller for hypotheses with larger weights. The learner will need to balance between choosing hypotheses with larger weights (thus having a smaller estimation error) to choosing hypotheses that fit well the training set. 13 Generalization bounds for weighted hypothesis classes P LetH be any hypothesis class, and letw :H 0; 1 be a function such that w(h) 1. One can think h2H ofw as assigning ‘weights’ to the hypotheses inH. Of course, ifH is finite, thenw can treat all members ofH equally, settingw(h) = 1=jHj for everyh2H. WhenH is infinite, such a uniform weighting is not possible but many other weighting schemes may work. For example, whenH is countable, one can pick any one-to-one function,f, from the set of natural numbers,N, toH and then define, for anyh in the range off,  2 1 w(h) = 1=f (h) . We will see how such a weighting of the members ofH can replace the assumption thatH is finite and allow some form of learnability for that class. P Theorem 2 LetH be a hypothesis class, and letw :H 0; 1 be any function satisfying w(h) 1. h2H Then, for every sample size,m, every confidence parameter, 0, and every distributionD overXf0; 1g " r ln(1=w(h)) + ln(2=) P 9h2H s.t.jerr (h) err (h)j  S D m SD 2m . q ln(1=w(h))+ln(2=) Proof For allh2H, define = . Using the union bound we have h 2m X P 9h2H s.t.jerr (h) err (h)j  P jerr (h) err (h)j : S D h S D h m m SD SD h2H We can bound each summand using Hoeffding’s inequality (Corollary 2): 2 P jerr (h) err (h)j  2 exp(2m ): S D h h m SD Combining the above two inequalities and plugging in the definition of we obtain that h X P 9h2H s.t.jerr (h) err (h)j  w(h): S D h m SD h2H P The theorem now follows from the assumption that w(h) 1. h2H Note that the generalization bound we established for a finite hypothesis class (Theorem 1) can be viewed as a special case of the above theorem by choosing the uniform weighting:w(h) = 1=jHj for allh2H. 4 – Minimum Description Length-1613.1 Bound minimization A direct corollary of Theorem 2 is the following: Corollary 3 Under the conditions of Theorem 2, for every sample size, m, every confidence parameter,  0, and every distributionD overXf0; 1g, with probability greater than 1 over a training set of sizem generated i.i.d. byD, r ln(1=w(h)) + ln(2=) 8h2H; err (h) err (h) + D S 2m . Such an error bound suggests a natural learning paradigm: Given a hypothesis class,H, and a weighting function,w, upon receiving a training set,S, search forh2H that minimizes the bound on the true error, namely return a hypothesis in r ln(1=w(h)) + ln(2=) argmin err (h) + : S 2m h2H That is, unlike the ERM paradigm discussed in previous lectures, we no longer just care about the empirical error, but also willing to trade off some of that error (or, if you wish, some bias) for the sake of a better estimation-error term (or, complexity). The above results can be applied to several natural weighting functions. We shall discuss two such examples; description length and prior probability over hypotheses. 13.2 Description Length and Occam’s Razor Having a hypothesis class, one can wonder about how do we describe, or represent, each function in the class. We naturally fix some description language. This can be English, or a programming language, or some set of mathematical formulas. Any of these languages consists of finite strings of symbols (or characters) drawn from some fixed alphabet. It is worthwhile to formalize these notions. LetH be some set (possibly infinite), the members of which we wish to describe. Fix some finite set  of symbols (or ”characters”), which we call the alpha-bet. For concreteness, we let  =f0; 1g. A string is a finite sequence of symbols from , for example = (0; 1; 1; 1; 0) is a string of length 5. We denote byjj  the length of a string. The set of all finite length strings is denoted  . A description language forH is a  functiond :H  , mapping each member,h ofH, to a stringd(h). We calld(h) ‘the description ofh’. We denote byjhj the length of the stringd(h). 0 We shall require that description languages are prefix-free. That is, for every distinct h;h , none of 0 d(h);d(h ) is a prefix of the other. That is, we do not allow that the string d(h) is exactly the firstjhj 0 symbols of the stringd(h ) and the other way around. Prefix-free collections of strings have the following nice combinatorial property:  Lemma 5 (Kraft inequality) IfSf0; 1g is a prefix-free set of strings, then X 1  1: jj 2 2S Proof Define a probability distribution over the members ofS as follows: repeatedly toss an unbiased coin, with faces labeled 0 and 1, until the sequence of outcomes is a member ofS, at that point, stop. For each 2S, letP () be the probability that this process generates the string. Note that sinceS is prefix-free, for every2S, if the coin toss outcomes follow the bits of than we will stop only once the sequence of 1 outcomes equals. We therefore get that, for every2S,P () = . Since probabilities add up to at jj 2 most one, our proof is concluded. 4 – Minimum Description Length-17In light of Kraft’s inequality, any prefix-free description language of a hypothesis class,H, gives rise 1 to a weighting functionw over that hypothesis class — we will simply setw(h) = . This observation jhj 2 immediately yields the following:  Theorem 3 LetH be a hypothesis class and letd :Hf0; 1g be a prefix-free description language for H. Then, for every sample size,m, every confidence parameter, 0, and every probability distribution,D overXf0; 1g, with probability greater than 1 over anm-size training set generated i.i.d. byD, r jhj + ln(2=) 8h2H; err (h) err (h) + : D S 2m jhj Proof Just substitutew(h) = 1=2 in Corollary 3 above. As was the case with Corollary 3, this result suggests a learning paradigm forH — given a training set, q jhj+ln(2=) S, search for a hypothesish2H that minimizes the error bound, err (h) + . In particular, it S 2m suggests trading off training error for short description length. 13.2.1 Occam’s razor Theorem 3 tell us that if we have two hypotheses that have the same training error, for the one that has shorter description our bound on its generalization error is lower. Thus, this result can be viewed as having a philosophical message — A short explanation (that is, a hypothesis that has a short length) tends to be more valid than a long explanation. This is a well known principle, called Occam’s razor, after William of Ockham, a 14th-century English logician, who is believed to have been the first to phrase it explicitly. Here, we seem to provide a rigorous justification to this principle. The inequality of Theorem 3 shows that the more complex a hypothesish is (in the sense of having a long description), the larger the sample size it has to fit needed to guarantee that it really has a small generalization error, err (h). The second term in the inequality requiresm, the sample size, to D be at least in the order ofjhj, the length of the description ofh. Note that, as opposed to the context in which the Occam razor principle is usually invoked in science, here we consider an arbitrary abstract description language, rather than a natural language. 13.2.2 The role of the description language At a second thought, there may be something bothering about our Occam razor result — the simplicity of a hypothesis is measured by the length of the string assigned to it by the description language. However, 0 this language is quite arbitrary. Assume that we have two hypotheses such thatjhj is much smaller than jhj. By the result above, if both have the same error on a given training set,S, then the true error ofh may 0 0 be much higher than the true error ofh , so one should preferh overh. However, we could have chosen a different description language, say one that assigns a string of length 3 toh and a string of length 100000 to 0 0 0 h . Suddenly it looks like one should preferh overh . But these are the sameh andh for which we argued 0 two sentences ago thath should be preferable. Where is the catch here? Indeed, there is no inherent generalizability difference between hypotheses. The crucial aspect here is the dependency order between the initial choice of language (or, preference over hypotheses) and the training set. As we know from the basic Hoeffding’s bound (Corollary 2), if we commit to any hypothesis before seeing the data, then we are guaranteed a rather small generalization error term err (h) err (h) + D S q ln(2=) . Choosing a description language (or, equivalently, some weighting of hypotheses) is a weak form 2m of committing to a hypothesis. Rather than committing to a single hypothesis, we spread out our commitment 4 – Minimum Description Length-18among many. As long as it is done independently of the training sample, our generalization bound holds. Just as the choice of a single hypothesis to be evaluated by a sample can be arbitrary, so is the choice of description language for the description length based bound. 13.3 Incorporating a prior probability over hypotheses Yet another intuitive aspect of Theorem 2 is gained by viewing the weighting function,w, as a probability distribution over the set of hypotheses,H. This is possible since all weights are non-negative and their sum is at most 1. Under such an interpretation, what the learner starts with is some prior probability, evaluating the likelihood of each hypothesis to be the best predictor. We call it ”prior” since, as discussed above, that probability (or weighting) must be decided before the learner accesses the training sample. Under this interpretation, Corollary 3 reads as follows: Theorem 4 LetH be a hypothesis class and let P be any probability distribution overH (the “prior”). For every sample size,m, every probability distributionD overXf0; 1g and every confidence parameter,  0, for everyh2H, with probability greater than 1 over anm-size training set generated i.i.d. byD, r ln(1=P (h)) + ln(2=) err (h) err (h) + : D S 2m In particular, this result suggests a learning paradigm that searches for a hypothesis h2H that balances a tradeoff between having low sample error and having high prior probability. Or, from a slightly different angle, the supporting evidence, in the form of a training sample, that one needs to validate an a priory unlikely hypothesis is required to be larger than what we need to validate a likely hypothesis. This is very much in line with our daily experience. 4 – Minimum Description Length-19(67577) Introduction to Machine Learning November 3, 2009 Lecture 5 – Decision Trees Lecturer: Ohad Shamir Scribe: Ohad Shamir Based on a book by Shai Ben-David and Shai Shalev-Shwartz (in preparation) A decision tree is a classifier,h :XY, that predicts the label associated with an instancex by traveling from a root node of a tree to a leaf. At each node on the root-to-leaf path, the successor child is chosen based on a splitting of the input space. Usually, the splitting is based on one of the attributes ofx or on a predefined set of splitting rules. A leaf contains a specific label. An example of a decision tree for the papayas example is given below: Color? pale green to pale yellow other not-tasty Softness? give slightly to palm pressure other not-tasty tasty To check if a given papaya is tasty or not, the decision tree above first examines the color of the Papaya. If this color is not in the range pale green to pale yellow, then the tree immediately predicts that the papaya is not tasty without additional tests. Otherwise, the tree turns to examine the softness of the papaya. If the softness level of the papaya is such that it gives slightly to palm pressure, the decision tree predicts that the papaya is tasty. Otherwise, the prediction is “not-tasty”. The above example underscores one of the main advantages of decision trees — the resulting classifier is simple to understand and easy to interpret. We can think on a decision tree as a splitting of the instance space,X , into cells, where each leaf of the tree corresponds to one cell. Clearly, if we allow decision trees of arbitrary size, for any training set,S, we can find in general a decision tree that attains a zero training error onS, simply by splitting the instance space 1 into small enough regions such that each region contains a single training example. Such an approach can easily lead to overfitting. To avoid overfitting, we can rely on the Occam’s razor principle described in the previous section, and aim at learning a decision tree that on one hand fits the data well while on the other hand is not too large. d For simplicity, we will assume thatX =f0; 1g . In other words, each instance is a vector ofd bits. We will also assume that our decision tree is always a binary tree, with the internal nodes of the form ‘x = 0?’ i for some i =f1;:::;dg. For instance, we can model the Papaya decision tree above by assuming that a Papaya is parameterized by a two-bit vectorx = (x ;x ), where the bitx represents whether the color is 1 2 1 ‘pale green to pale yellow’ or not, and the bit x represents whether the softness is ‘give slightly to palm 2 pressure’ or not. With this representation the node ‘Color?’ can be replaced with ‘x = 0?’, and the node 1 ‘Softness?’ can be replaced with ‘x = 0?’. While this is a big simplification, the algorithms and analysis 2 we provide below can be extended to more general cases. With these simplifying assumptions, the hypothesis class becomes finite, but is still very large. In par- d ticular, it is possible to show that any classifier fromf0; 1g tof0; 1g can be represented by a decision tree. 1 We might have two identical examples in the training set that have different labels, but if the instance space,X , is large enough and the distribution is not focused on few elements ofX , such an event will occur with a very low probability. 5 – Decision Trees-20