Knowledge in learning in Artificial intelligence

knowledge management in learning organization, knowledge management in e learning practices, knowledge centered learning environment,knowledge management learning for organizational experience
HalfoedGibbs Profile Pic
HalfoedGibbs,United Kingdom,Professional
Published Date:02-08-2017
Your Website URL(Optional)
Comment
KNOWLEDGE IN 19 LEARNING In which we examine the problem of learning when you know something already. In all of the approaches to learning described in the previous three chapters, the idea is to construct a function that has the inputloutput behavior observed in the data. In each case, the learning methods can be understood as searching a hypothesis space to find a suitable function, starting from only a very basic assumption about the form of the function, such as "second degree polynomial" or "decision tree" and a bias such as "simpler is better." Doing this amounts to saying that before you can learn something new, you must first forget (almost) everything you know. In this chapter, we study learning methods that can take advantage PRIORKNOWLEDGE of prior knowledge about the world. In most cases, the prior knowledge is represented as general first-order logical theories; thus for the first time we bring together the work on knowledge representation and learning. Chapter 18 defined pure inductive learning as a process of finding a hypothesis that agrees with the observed examples. Here, we specialize this definition to the case where the hypoth- esis is represented by a set of logical sentences. Example descriptions and classifications will also be logical sentences, and a new example can be classified by inferring a classification sentence from the hypothesis and the example description. This approach allows for incre- mental construction of hypotheses, one sentence at a time. It also allows for prior knowledge, because sentences that are already known can assist in the classification of new examples. The logical formulation of learning may seem like a lot of extra work at first, but it turns out to clarify many of the issues in learning. It enables us to go well beyond the simple learning methods of Chapter 18 by using the full power of logical inference in the service of learning. Examples and hypotheses Recall from Chapter 18 the restaurant learning problem: learning a rule for deciding whether to wait for a table. Examples were described by attributes such as Alternate, Bar, FrilSat, Section 19.1. A Logical Formulation of Learning 679 and so on. In a logical setting, an example is an object that is described by a logical sentence; the attributes become unary predicates. Let us generically call the ith example Xi. For instance, the first example from Figure 18.3 is described by the sentences We will use the notation Di (Xi) to refer to the description of Xi, where Di can be any logical expression taking a single argument. The classification of the object is given by the sentence Will Wait (XI) . We will use the generic notation Q (X,) if the example is posil.ive, and 1Q (X,) if the example is negative. The complete training set is then just the conjuriction of all the description and classification sentences. The aim of inductive learning in the logical setting is to find an equivalent logical ex- pression for the goal predicate Q that we can use to classify examples correctly. Each hypoth- CANDIDATE esis proposes such an expression, which we call a candidate definition of the goal predicate. DEFINITION Using C, to denote the candidate definition, each hypothess Hz is a sentence of the form Q(x) C,(x). For example, a decision tree asserts that the goal predicate is true 'v' x of an object if only if one of the branches leading to true is satisfied. Thus, the Figure 18.6 expresses the following logical definition (which we will call H, for future reference): 'v' r 1Wzll Wazt (r) Patrons (r, Some) V Patrons(r, Full) A Hungry(r) A Type(r, French) V Patrons(r, Full) A Hungry(r) A Type(r, Thaz) (19.1) A Frz/Sat(r) V Patrons(r, Full) A Hungry(r) A Type(r, Burger) . Each hypothesis predicts that a certain set of examplesnamely, those that satisfy its candi- EXTENSION date definition-will be examples of the goal predicate. This set is called the extension of the predicate. Two hypotheses with different extensions are therefore logically inconsistent with each other, because they disagree on their predictions for at least one example. If they have the same extension, they are logically equivalent. The hypothesis space H is the set of all hypotheses (HI, . . . , H,) that the learning algo- rithm is designed to entertain. For example, the DECISION-TREE-LEARNING algorithm can entertain any decision tree hypothesis defined in terms of the attributes provided; it; hypoth- esis space therefore consists of all these decision trees. Presumably, the learning algorithm believes that one of the hypotheses is correct; that is, it believes the sentence As the examples arrive, hypotheses that are not consistent with the examples can be ruled out. Let us examine this notion of consistency more carefully. Obviously, if hypothesis Hi is consistent with the entire training set, it has to be consistent with each example. What would it mean for it to be inconsistent with an example? This can happen in one of two ways: FALSE NEGATIVE e An example can be a false negative for the hypotliesis, if the hypothesis says it should be negative but in fact it is positive. For instance, the new example XI3 described by Patrons (X13, Full) A Wait(X13, 0-10) A Hunyry (X-13) A . . . A Will Wait (X13) 680 Chapter 19. Knowledge in Learning would be a false negative for the hypothesis H, given earlier. From H, and the example description, we can deduce both WillWait(X13), which is what the example says, and 1 WzllWait(X13), which is what the hypothesis predicts. The hypothesis and the example are therefore logically inconsistent. FALSE POSITIVE An example can be a false positive for the hypothesis, if the hypothesis says it should be positive but in fact it is negative.' If an example is a false positive or false negative for a hypothesis, then the example and the hypothesis are logically inconsistent with each other. Assuming that the example is a correct observation of fact, then the hypothesis can be ruled out. Logically, this is exactly analo- gous to the resolution rule of inference (see Chapter 9), where the disjunction of hypotheses corresponds to a clause and the example corresponds to a literal that resolves against one of the literals in the clause. An ordinary logical inference system therefore could, in principle, learn from the example by eliminating one or more hypotheses. Suppose, for example, that the example is denoted by the sentence 11, and the hypothesis space is Hl V H2 V H3 V H4. Then if Il is inconsistent with H2 and H3, the logical inference system can deduce the new hypothesis space H1 V H4. We therefore can characterize inductive learning in a logical setting as a process of gradually eliminating hypotheses that are inconsistent with the examples, narrowing down the possibilities. Because the hypothesis space is usually vast (or even infinite in the case of first-order logic), we do not recommend trying to build a learning system using resolution- based theorem proving and a complete enumeration of the hypothesis space. Instead, we will describe two approaches that find logically consistent hypotheses with much less effort. Current-best-hypothesis search H CURRENT-BEST YPOTHESIS The idea behind current-best-hypothesis search is to maintain a single hypothesis, and to adjust it as new examples arrive in order to maintain consistency. The basic algorithm was described by John Stuart Mill (1843), and may well have appeared even earlier. Suppose we have some hypothesis such as H,, of which we have grown quite fond. As long as each new example is consistent, we need do nothing. Then along comes a false negative example, X13. What do we do? Figure 19.l(a) shows H, schematically as a region: everything inside the rectangle is part of the extension of H,. The examples that have actually been seen so far are shown as "+" or "-", and we see that H, correctly categorizes all the examples as positive or negative examples of Will Wait. In Figure 19.l(b), a new example (circled) is a false negative: the hypothesis says it should be negative but it is actually positive. GENERALIZATION The extension of the hypothesis must be increased to include it. This is called generalization; one possible generalization is shown in Figure 19.l(c). Then in Figure 19.l(d), we see a false positive: the hypothesis says the new example (circled) should be positive, but it actually is negative. The extension of the hypothesis must be decreased to exclude the example. This is SPECIALIZATION called specialization; in Figure 19.l(e) we see one possible specialization of the hypothesis. The tenns "false positive" and "false negative" are used in medicine to describe erroneous results from lab tests. A result is a false positive if it indicates that the patient has the disease when in fact no disease is present. Section 19.1. A Logical Formulation of Learning 68 1 Figure 19.1 (a) A consistent hypothesis. (b) A false negative. (c) The hypothesis is gen- eralized. (d) A false positive. (e) The hypothesis is specialized. J The "more general than" and "more specific than" relations between hypotheses provide the logical structure on the hypothesis space that makes efficient search possible. We can now specify the CURRENT-B EST-LEARNING algorithm, shown in Figure 19.2. Notice that each time we consider generalizing or specializing the hypothesis, we miust check for consistency with the other examples, because an arbitrary increaseldecrease in the exten- sion might includelexclude previously seen negativelpositive exaimples. function CURRENT-BEST-LEARNING(XS) returns a hypothesis H c any hypothesis consistent with the first example in examples for each remaining example in examples do if e is false positive for H then H t choose a specialization of H consistent with examples else if e is false negative for H then H + choose a generalization of H consistent with examples if no consistent specialization/generalization can be found then fail return H Figure 19.2 The current-best-hypothesis learning algorithm. It searches for a consistent hypothesis and backtracks when no consistent specialization/generalization can be found. We have defined generalization and specialization as operations that change the exten- sion of a hypothesis. Now we need to determine exactly how they can be implemented as syntactic operations that change the candidate definition associated with the hypothesis, so that a prograin can carry them out. This is done by first noting that generalization and. special- ization are also logical relationships between hypotheses. If hypothesis HI, with definition C1, is a generalization of hypothesis H2 with definition C2, then we must have 't x C2(x) =+ Cl (x) . Therefore in order to construct a generalization of Hz, we simply need to find a defini- tion Cl that is logically implied by C2. This is easily done. For example, if C2(x) is 682 Chapter 19. Knowledge in Learning Alternate(x) A Patrons(x, Some), then one possible generalization is given by Cl (x) - Patrons(x, Some). This is called dropping conditions. Intuitively, it generates a weaker definition and therefore allows a larger set of positive examples. There are a number of other generalization operations, depending on the language being operated on. Similarly, we can specialize a hypothesis by adding extra conditions to its candidate definition or by removing disjuncts from a disjunctive definition. Let us see how this works on the restaurant example, using the data in Figure 18.3. The first example X1 is positive. Alternate(X1) is true, so let the initial hypothesis be HI : Vx Will Wait(x) H Alternate(x) . The second example X2 is negative. HI predicts it to be positive, so it is a false positive. Therefore, we need to specialize HI. This can be done by adding an extra condition that will rule out X2. One possibility is Will Wait (x) Alternate(x) A Patrons(x, Some) H2 : Vx The third example Xg is positive. H2 predicts it to be negative, so it is a false negative. Therefore, we need to generalize H2. We drop the Alternate condition, yielding H3 : V x Will Wait(x) ++ Patrons (x, Some) The fourth example Xg is positive. H3 predicts it to be negative, so it is a false negative. We therefore need to generalize H3. We cannot drop the Patrons condition, because that would yield an all-inclusive hypothesis that would be inconsistent with X2. One possibility is to add a disjunct: H4 : Vx Will Wait (x) ( Patrons(x, Some) V (Patrons(x, Full) A Fri/Sat(x)) Already, the hypothesis is starting to look reasonable. Obviously, there are other possibilities consistent with the first four examples; here are two of them: Hi : Vx Will Wait(x) 1 WaitEstimate(x, 30-60) . Hi : 'd x Will Wait (x) Patrons (x, Some) V (Patrons (x, Full) A WaitEstimate (x, 10-30)) . The CURRENT-BEST-LEARNING algorithm is described nondeterministically, because at any point, there may be several possible specializations or generalizations that can be applied. The choices that are made will not necessarily lead to the simplest hypothesis, and may lead to an unrecoverable situation where no simple modification of the hypothesis is consistent with all of the data. In such cases, the program must backtrack to a previous choice point. The CURRENT-BEST-LEARNING algorithm and its variants have been used in many machine learning systems, starting with Patrick Winston's (1970) "arch-learning" program. With a large number of instances and a large space, however, some difficulties arise: 1. Checking all the previous instances over again for each modification is very expensive. 2. The search process may involve a great deal of backtracking. As we saw in Chapter 18, hypothesis space can be a doubly exponentially large place. Section 19.1. A Logical Formulation of Learning 683 Least-commitment search Backtracking arises because the current-best-hypothesis approach has to choose a particular hypothesis as its best guess even though it does not have enough data yet to be wre of the choice. What we can do instead is to keep around all and only those hypotheses that are consistent with all the data so far. Each new instance will either have no effect or will get rid of some of the hypotheses. Recall that the original hypothesis space can be viewed as a disjunctive sentence HlVH2VH 3...VHn. As various hypotheses are found to be inconsistent with the examples, this disjunction shrinks, retaining only those hypotheses not ruled out. Assuming that the original hypothesis space does in fact contain the right answer, the reduced disjunction must still contain the right an- swer because only incorrect hypotheses have been removed. The set of hypotheses remaining VERSIONSPACE is called the version space, and the learning algorithm (sketched in Figure 19.3) is called the CANDIDATE version space learning algorithm (also the candidate ellimination algorithm). ELIMINATION function VERSION-SPACE-LEARNING(XS) returns a version space local variables: V, the version space: the set of all hypotheses V c the set of all hypotheses for each example e in examples do if V is not empty then V + VERSION-SPACE-UPDATE('V, e) return V function VERSION-SPACE-UPDATE( V, e) returns an upldated version space V c h E V : h is consistent with e) Figure 19.3 The version space learning algorithm. 11: finds a subset of V that is consistent with the examples. One important property of this approach is that it is incremental: one never has to go back and reexamine the old examples. All remaining hypotheses are guaranteed to be consistent with them anyway. It is also a least-commiltment algorithm because it makes no arbitrary choices (cf. the partial-order planning algorithm in Chapter 11). But there is an obvious problem. We already said that the hypothesis space is enormous, so how can we possibly write down this enormous disjunction? The following simple analogy is very helpful. Mow do you represent all the real num- bers between 1 and 2? After all, there is an infinite number of them The answer is to use an interval representation that just specifies the boundaries of the set: 1,2. It works because we have an ordering on the real numbers. We alslo have an ordering on the hypothesis space, namely, generalization/specialization. This is a pal-tial ordering, which means that each boundary will not be a point but rather a BOUNDARY SET set of hypotheses called a boundary set. The great thing is that we can represent the entire 684 Chapter 19. Knowledge in Learning This regian all inconsistent More general I More specific I This region all inconsistent I Figure 19.4 The version space contains all hypotheses consistent with the examples. I G-SET version space using just two boundary sets: a most general boundary (the G-set) and a most S-SET specific boundary (the S-set). Everything in between is guaranteed to be consistent with the examples. Before we prove this, let us recap: The current version space is the set of hypotheses consistent with all the examples so far. It is represented by the S-set and G-set, each of which is a set of hypotheses. Every member of the S-set is consistent with all observations so far, and there are no consistent hypotheses that are more specific. Every member of the G-set is consistent with all observations so far, and there are no consistent hypotheses that are more general. We want the initial version space (before any examples have been seen) to represent all possi- ble hypotheses. We do this by setting the G-set to contain Due (the hypothesis that contains everything), and the S-set to contain False (the hypothesis whose extension is empty). Figure 19.4 shows the general structure of the boundary set representation of the version space. To show that the representation is sufficient, we need the following two properties: 1. Every consistent hypothesis (other than those in the boundary sets) is more specific than some member of the G-set, and more general than some member of the S-set. (That is, there are no "stragglers" left outside.) This follows directly from the definitions of S and G. If there were a straggler h, then it would have to be no more specific than any member of G, in which case it belongs in G; or no more general than any member of S, in which case it belongs in S. 2. Every hypothesis more specific than some member of the G-set and more general than some member of the S-set is a consistent hypothesis. (That is, there are no "holes" be- Section 19.1. A Logical Formulation of Learning 685 tween the boundaries.) Any h between S and G must reject all the negative examples rejected by each member of G (because it is more specific), and must accept all the pos- itive examples accepted by any member of S (because it is more general). Thus, h must agree with all the examples, and therefore cannot be inconsistent. Figure 19.5 shows the si1:uation: there are no known examples outside S but inside G, so any hypothesis in the gap must be consistent. We have therefore shown that if S and G are maintained according to their definitions, then they provide a satisfactory representation of the version space. The only remaining problem is how to update S and G for a new example (the job of the VERSION-SPACE-UPDATE function). This may appear rather complicated at first, but from the definitions and with the help of Figure 19.4, it is not too hard to reconstruct the algorithm. Figure 19.5 The extensions of the members of G and S. No known examples lie in between the two sets of boundaries. 1 We need to worry about the members Si and Gi of the S- and G-sets. For each one, the new instance may be a false positive or a false negative. I. False positive for Si: This means Si is too general, but there are no consistent special- izations of Si (by definition), so we throw it out ad the S-set. 2. False negative for Si: This means Si is too specific, so we replace it by all its immediate generalizations, provided they are more specific t.han some member of G. 3. False positive for Gi: This means Gi is too general, so we replace it by all its immediate specializations, provided they are more general than some member of S. 4. False negative for Gi: This means Gi is too specific, but there are no consistent gener- alizations of Gi (by definition) so we throw it out of the G-set. We continue these operations for each new instance until one of three things happens: 1. We have exactly one concept left in the version space, in which case we return it as the unique hypothesis. 2. The version space collapses-either S or G becomes empty, indicating that there are no consistent hypotheses for the training set. This is the same case as the failure of the simple version of the decision tree algorithm. 686 Chapter 19. Knowledge in Learning 3. We run out of examples with several hypotheses remaining in the version space. This means the version space represents a disjunction of hypotheses. For any new example, if all the disjuncts agree, then we can return their classification of the example. If they disagree, one possibility is to take the majority vote. We leave as an exercise the application of the VERSION-SPACE-LEARNING algorithm to the restaurant data. There are two principal drawbacks to the version-space approach: If the domain contains noise or insufficient attributes for exact classification, the version space will always collapse. If we allow unlimited disjunction in the hypothesis space, the S-set will always contain a single most-specific hypothesis, namely, the disjunction of the descriptions of the -set will contain just the negation of the positive examples seen to date. Similarly, the G disjunction of the descriptions of the negative examples. For some hypothesis spaces, the number of elements in the S-set of G-set may grow exponentially in the number of attributes, even though efficient learning algorithms exist for those hypothesis spaces. To date, no completely successful solution has been found for the problem of noise. The problem of disjunction can be addressed by allowing limited forms of disjunction or by in- H GENERnLJzAT'oN IERARCHY cluding a generalization hierarchy of more general predicates. For example, instead of using the disjunction WaitEstimate(x, 30-60) V WaitEstimate(x, 60), we might use the single literal Long Wait(x). The set of generalization and specialization operations can be easily extended to handle this. The pure version space algorithm was first applied in the M-DENDRAL system, which was designed to learn rules for predicting how molecules would break into pieces in a mass spectrometer (Buchanan and Mitchell, 1978). M-DENDRAL was able to generate rules that were sufficiently novel to warrant publication in a journal of analytical chernistry- the first real scientific knowledge generated by a computer program. It was also used in the elegant LEX system (Mitchell et al., 1983), which was able to learn to solve symbolic integra- tion problems by studying its own successes and failures. Although version space methods are probably not practical in most real-world learning problems, mainly because of noise, they provide a good deal of insight into the logical structure of hypothesis space. The preceding section described the simplest setting for inductive learning. To understand the role of prior knowledge, we need to talk about the logical relationships among hypotheses, example descriptions, and classifications. Let Descriptions denote the conjunction of all the example descriptions in the training set, and let Classifications denote the conjunction of all the example classifications. Then a Hypothesis that "explains the observations" must satisfy Section 19.2. Knowledge in Learning 687 the followiizg property (recall that k means "logically entails"): Hypothesis A Descrzptions k Classifications . (19.3) ENTAILMENT We call this kind of relationship an entailment constraint, in which Hypothesis is the "un- CONSTRAINT known." Pure inductive learning means solving this constraint, where Hypotheszs is drawn from some predefined hypothesis space. For example, if we consider a decision tree as a logical formula (see Equation (19.1) on page 679), then a decision tree that is consistent with all the examples will satisfy Equation (19.3). If we place no restrictions on the logical form of the hypothesis, of course, then Hypotheszs = Classzficatzons also satisfies the constraint. 7 Ockham s razor tells us to prefer small, consistent hypotheses, so we try to do better than simply memorizing the examples. This simple knowledge-free picture of inductive learning persisted until the early 1980s. The modem approach is to design agents that already kitow something and are trying to learn some more This may not sound like a terrifically deep insight, but it makes quite a difference to the way we design agents. It might also have some relevance to our theories albout how science itself works. The general idea is shown schematically in Figure 19.6. Figure 19.6 A cumulative learning process uses, and adds to, its stock of background knowledge over time. If we want to build an autonomous learning agent that uses background knowledge, the agent must have some method for obtaining the background knowledge in the first place, in order for it to be used in the new learning episodes. This method rnust itself be a learning process. The agent's life history will therefore be characterized by cumulative, or incremen- tal, development. Presumably, the agent could start out with nothing, performing inductions in vacuo like a good little pure induction program. But once it has eaten from the Tree of Knowledge, it can no longer pursue such naive speculations and. should use its background knowledge to learn more and more effectively. The question is then how to actually do this. Some simple examples Let US consider some commonsense examples of learning with background knowledge. Many apparently rational cases of inferential behavior in the face of observations clearly do not follow the simple principles of pure induction. Sometimes one leaps to general conclusions after only one observation. Gary Larson once drew a cartoon in which a bespectacled caveman, Zog, is roasting his lizard on 688 Chapter 19. Knowledge in Learning the end of a pointed stick. He is watched by an amazed crowd of his less intellectual contemporaries, who have been using their bare hands to hold their victuals over the fire. This enlightening experience is enough to convince the watchers of a general principle of painless cooking. r Or consider the case of the traveller to Brazil meeting her first Brazilian. On hearing him speak Portuguese, she immediately concludes that Brazilians speak Portuguese, yet on discovering that his name is Fernando, she does not conclude that all Brazilians are called Fernando. Similar examples appear in science. For example, when a freshman physics student measures the density and conductance of a sample of copper at a par- ticular temperature, she is quite confident in generalizing those values to all pieces of copper. Yet when she measures its mass, she does not even consider the hypothesis that all pieces of copper have that mass. On the other hand, it would be quite reasonable to make such a generalization over all pennies. r Finally, consider the case of a pharmacologically ignorant but diagnostically sophisti- cated medical student observing a consulting session between a patient and an expert internist. After a series of questions and answers, the expert tells the patient to take a course of a particular antibiotic. The medical student infers the general rule that that particular antibiotic is effective for a particular type of infection. These are all cases in which the use of background knowledge allows much faster learning than one might expect from a pure induction program. Some general schemes In each of the preceding examples, one can appeal to prior knowledge to try to justify the - generalizations chosen. We will now look at what kinds of entailment constraints are operat ing in each case. The constraints will involve the Background knowledge, in addition to the Hypothesis and the observed Descriptions and Classifications. In the case of lizard toasting, the cavemen generalize by explaining the success of the pointed stick: it supports the lizard while keeping the hand away from the fire. From this explanation, they can infer a general rule: that any long, rigid, sharp object can be used to toast small, soft-bodied edibles. This kind of generalization process has been called explanation- EXPLANATION BASED based learning, or EBL. Notice that the general rule follows logically from the background LEARNING knowledge possessed by the cavemen. Hence, the entailment constraints satisfied by EBL are the following: Hypothesis A Descrzptions I= Classificatzons Background Hypothesis . Because EBL uses Equation (19.3), it was initially thought to be a better way to learn from examples. But because it requires that the background knowledge be sufficient to explain the Hypothesis, which in turn explains the observations, the agent does not actually learn anything factually new from the instance. The agent could have derived the example from what it already knew, although that might have required an unreasonable amount of compu- tation. EBL is now viewed as a method for converting first-principles theories into useful, special-purpose knowledge. We describe algorithms for EBL in Section 19.3. 689 Section 19.2. Knowledge in Learning The situation of our traveler in Brazil is quite different, for she cannot necessarily ex- plain why Fernando speaks the way he does, unless she knows her Papal bulls. Moreover, the same generalization would be forthcoming from a traveler entirely ignorant of colonial history. The relevant prior knowledge in this case is that, within any given country, most people tend to speak the same language; on the other hand, Fernando is not assumed to be the name sf all Brazilians because this kind of regularity does not hold for names. Similarly, the freshma.n physics student also would be hard put to explain the particular values that she discovers for the conductance and density of copper. Slhe does know, however, that the mate- rial of which an object is composed and its temperature together determine its conductance. RELEVANCE In each case, the prior knowledge Background concerns the relevance of a set of features to the goal predicate. This knowledge, together with the obsenations, allows the agent to infer a new, general rule that explains the observations: Hypothesis A Descriptions I= ClasszficaiFions , (19.4) Background A Descrzptions A Classifications + ,Hypothesis . RELEVANCE-BASED We call this kind of generalization relevance-based learning, or RBL (although the name is LEARNING not standard). Notice that whereas RBL does make usie of the content of the observations, it does not produce hypotheses that go beyond the logical content of the background knowledge and the observations. It is a deductive form of learning and cannot by itself account for the creation of new knowledge starting from scratch. In the case of the medical student watching the expert, we assume that the student's prior knowledge is sufficient to infer the patient's disease D from the symptoms. This is not, however, enough to explain the fact that the doctor prescribes a particular medicine M. The student needs to propose another rule, namely, that Ad generally is effective against D. Given this rule and the student's prior knowledge, the student can now explain why the expert prescribes M in this particular case. We can generalize this example to come up with the entailment constraint: Background A Hypothesis A Descriptions I= c7lassijications . (19.5) That is, the background knowledge and the new hypothesis combine to explain the examples. As with pure inductive learning, the learning algorithm should propose hypotheses that are as simple as possible, consistent with this constraint. Algorithms that satisfy constraint (19.5) KNOWLEDGE-BASED INDUCTIVE are called knowledge-based inductive learning, or KBIL, algorithms. LEARNING KBIL algorithms, which are described in detail in Section 19.5, have been studied mainly in the field of inductive logic programming, or ILP. In ILP systems, prior knowl- M:IC edge plays two key roles in reducing the complexity of learning: 1. Because any hypothesis generated must be consistent with the prior knowledge as well as with the new observations, the effective hypothesis space size is reduced .to include only those theories that are consistent with what is already known. 2. For any given set of observations, the size of the hypothesis required to construct an explanation for the observations can be much reduced, because the prior knowledge will be available to help out the new rules in explaining the observations. The smaller the hypothesis, the easier it is to find. 690 Chapter 19. Knowledge in Learning In addition to allowing the use of prior knowledge in induction, ILP systems can formulate hypotheses in general first-order logic, rather than in the restricted attribute-based language of Chapter 18. This means that they can learn in environments that cannot be understood by simpler systems. As we explained in the introduction to this chapter, explanation-based learning is a method for extracting general rules from individual observations. As an example, consider the problem of differentiating and simplifying algebraic expressions (Exercise 9.15). If we differentiate an expression such as x2 with respect to X, we obtain 2X. (Notice that we use a capital letter for the arithmetic unknown X, to distinguish it from the logical variable x.) In a logical reasoning system, the goal might be expressed as (erivative(, X) = d, KB), with solution d = 2X. Anyone who knows differential calculus can see this solution "by inspection" as a result of practice in solving such problems. A student encountering such problems for the first time, or a program with no experience, will have a much more difficult job. Application of the standard rules of differentiation eventually yields the expression 1 x (2 x (x('))), and 7 eventually this simplifies to 2X. In the authors logic programming implementation, this takes 136 proof steps, of which 99 are on dead-end branches in the proof. After such an experience, we would like the program to solve the same problem much more quickly the next time it arises.. MEMOIZATION The technique of memoization has long been used in computer science to speed up programs by saving the results of computation. The basic idea of memo functions is to accumulate a database of inputloutput pairs; when the function is called, it first checks the database to see whether it can avoid solving the problem from scratch. Explanation-based learning takes this a good deal further, by creating general rules that cover an entire class of cases. In the case of differentiation, memoization would remember that the derivative of x2 with respect to X is 2X, but would leave the agent to calculate the derivative of Z2 with 2 respect to Z from scratch. We would like to be able to extract the general rule that for any arithmetic unknown u, the derivative of u2 with respect to u is 2u. In logical terms, this is expressed by the rule Arithmetic Unknowa(u) + erivative(u, u) = 2u . If the knowledge base contains such a rule, then any new case that is an instance of this rule can be solved immediately. - This is, of course, merely a trivial example of a very general phenomenon. Once some thing is understood, it can be generalized and reused in other circumstances. It becomes an "obvious" step and can then be used as a building block in solving problems still more com- plex. Alfred North Whitehead (191 I), co-author with Bertrand Russell of Principia Mathe- n Of course, a general rule for u can also be produced, but the current example suffices to make the point. Section 19.3. Explanation-Based Learning 69 1 matica, wrote "Civilization advances by extending the number of important operations that we can do without thinking about them," perhaps himself applying EBL to his understanding of events such as Zog's discovery. If you have understood the basic idea of the differenti- ation example, then your brain is already busily trying to extract the general principles of explanation-based learning from it. Notice that you hadn't already invented EBL before you saw the example. Like the cavemen watching Zog, you (and we) needed an example before we could generate the basic principles. This is because explaining why something is a good idea is mucli easier than coming up with the idea in the first place. Extracting general rules from examples The basic idea behind EBL is first to construct an explanation of the observation using prior knowledge, and then to establish a definition of the cla.ss of cases for which the saime expla- nation structure can be used. This definition provides the basis for a rule covering all of the cases in the class. The "explanation" can be a logical proof, but more generally it can be any -solving process whose steps are well defined. The key is to be able to reasoning or problem identify the necessary conditions for those same steps to apply to another case. We will use for our reasoning system the simple backward-chaining theorem prover described in Chapter 9. The proof tree for erivative(, X) = 2X is too large to use as an example, so we will use a simpler problem to illustrate the generalization method. Suppose our problerri is to simplify 1 x (0 + X). The knowledge base includes the following rules: Rewrite(u, v) A Simplify(v, w) + Simplzfy(u, w) . Primitive (u) ==+- Simp Lify (u, u) . Aithmetic Unknown (u) + Primitive (u) . Number (u) + Primitive (u) . Rewrite(1 x u, u) . Rewrite(0 + u, u) . The proof that the answer is X is shown in the top half of Figure 19.7. The EBL method actually conlstructs two proof trees simultaneously. The second proof tree uses a variabilized goal in which the constants from the original goal are replaced by variables. As the original proof proceeds, the variabilized proof proceeds in step, using exactly the same rule applica- tions. This could cause some of the variables to become instantiated. For example, in order to use the rule Rewrite(1 x u, u), the variable x in the subgoal Rewrite(x x (y + z), v) must be bound to 1. Similarly, y must be bound to 0 in the subgoal Rewrite(y + z, v') in order to use the rule Rewrite(0 + u, u). Once we have the generalized proof tree, we take the leaves (with the necessary bindings) and form a general rule fior the goal predicate: Rewrite(1 x (0 + z), 0 + z) A Rewrite(0 + z, z) A Arithmetic Unknown(z) + Simplify(1 x (0 + z), z) . Notice that the first two conditions on the left-hand side are true regardless of the value of z. We can therefore drop them from the rule, yielding ArithmeticUnknown(z) + Szmplify(1 x (0 - z), z) 692 Chapter 19. Knowledge in Learning Rewrite(1 x (O+X),v) Yes. Iv/O+Xl Rewrite(O+X,vf) Yes, v'/X I ArithmeticUnknown(X) Simplify(x x Cy+z),w) Yes, Rewrite(x x Cy+z),v) 1 implif(y+z,w) 1 Yes, x/l, v/y+z Simplzfi(z, w) w/z) Yes, y/O, i'/z Primitive(z) I Proof trees for the simplification problem. The first tree shows the proof for Figure 19.7 the original problem instance, from which we can derive Arithmetic Unknown(z) =+ Simplify(1 x (0 + z), z) I The second shows the proof for a problem instance with all constants replaced by variables, from which we can derive a variety of other rules. In general, conditions can be dropped from the final rule if they impose no constraints on the variables on the right-hand side of the rule, because the resulting rule will still be true and will be more efficient. Notice that we cannot drop the condition ArithmeticUnknown(z), because not all possible values of x are arithmetic unknowns. Values other than arithmetic unknowns might require different forms of simplification: for example, if z were 2 x 3, then the correct simplification of 1 x (0 + (2 x 3)) would be 6 and not 2 x 3. To recap, the basic EBL process works as follows: 1. Given an example, construct a proof that the goal predicate applies to the example using the available background knowledge. 2. In parallel, construct a generalized proof tree for the variabilized goal using the same inference steps as in the original proof. 3. Construct a new rule whose left-hand side consists of the leaves of the proof tree and whose right-hand side is the variabilized goal (after applying the necessary bindings from the generalized proof). 4. Drop any conditions that are true regardless of the values of the variables in the goal. Section 19.3. Explanation-Based Learning 693 Improving efficiency The generalized proof tree in Figure 19.7 actually yields more than one generalized rule. For example, if we terminate, or prune, the growth of the right-hand branch in the proof tree when it reaches the Primitive step, we get the rule Prirnitive(z) + Simplify(1 x (0 + z), z) . This rule is as valid as, but more general than, the rule using Arithmetic Unknown, because z is a number. We can extract a still more general rule by pruning after it covers cases where the step Simplify(y + 2, w), yielding the rule Simplzfy(y + z, w) + Simplify(1 x (y + x), w) . In general, a rule can be extracted from any partial subtree of the generalized proof tree. Now we have a problem: which of these rules do we choose? The choice of which rule to generate comes down to the question of efficiency. There are three factors involved in the analysis of efficiency gains from EBL: 1. Adding large numbers of rules can slow down the reasoning process, because the in- ference mechanism must still check those rules even in cases where they do riot yield a solution. In other words, it increases the branching factor in the search space. 2. To compensate for the slowdown in reasoning, tlhe derived rules must offer significant increases in speed for the cases that they do cover. These increases come about mainly because the derived rules avoid dead ends that would otherwise be taken, but also be- cause they shorten the proof itself. 3. Derived rules should be as general as possible, SCI that they apply to the largest possible set of cases. A common approach to ensuring that derived rules are efficient is to insist on the operational- OPERATIONALITY ity of each subgoal in the rule. A subgoal is operational if it is "easy" to solve. For example, the subgoal Primitive(z) is easy to solve, requiring at most two steps, whereas the subgoal Simplify(y + z, w) could lead to an arbitrary amount of inference, depending on the values of y and z. If a test for operationality is carried out at each step in the construction of the generalized proof, then we can prune the rest of a branch as soon as an operational subgoal is found, keeping just the operational subgoal as a conjurct of he new rule. Unfortunately, there is usually a tradeoff between operationality and generality. More specific subgoals are generally easier to solve but cover fewer cases. Also, ope-ationality is a matter of degree: one or 2 steps is definitely operational, but what about 110 or loo? Finally, the cost of solving a given subgoal depends om what other rules are available in the knowledge base. It can go up or down as more rules are added. Thus, EBL systems really face a very complex optimization problem in trying to maximize the efficiency of a given initial knowledge base. It is sometimes possible to derive a mathematical model of the effect on overall efficiency of adding a given rule and to use this model to select the best rule to add. The analysis can become very complicated, however, especially when recursive rules are involved. One promising approach is to address the problem of efficiency empirically, simply by adding several rules and seeing which ones are useful and actually speed things up. 694 Chapter 19. Knowledge in Learning Empirical analysis of efficiency is actually at the heart of EBL. What we have been calling loosely the "efficiency of a given knowledge base" is actually the average-case com- plexity on a distribution of problems. By generalizing from past example problems, EBL makes the knowledge base more eficient for the kind of problems that it is reasonable to expect. This works as long as the distribution of past examples is roughly the same as for future examples-the same assumption used for PAC-learning in Section 18.5. If the EBL system is carefully engineered, it is possible to obtain significant speedups. For example, a -based natural language system designed for speech-to-speech translation very large Prolog between Swedish and English was able to achieve real-time performance only by the appli- cation of EBL to the parsing process (Samuelsson and Rayner, 1991). Our traveler in Brazil seems to be able to make a confident generalization concerning the lan- guage spoken by other Brazilians. The inference is sanctioned by her background knowledge, namely, that people in a given country (usually) speak the same language. We can express this in first-order logic as follow: (Literal translation: "If x and y have the same nationality n and x speaks language 1, then y also speaks it.") It is not difficult to show that, from this sentence and the observation that Nationality(Fernando, Brazil) A Language(Fernand0, Portuguese) , the following conclusion is entailed (see Exercise 19.1): Nationality (x, Brazil) + Language (x, Portuguese) . Sentences such as (19.6) express a strict form of relevance: given nationality, language is fully determined. (Put another way: language is a function of nationality.) These sentences FUNCTIONAL are called functional dependencies or determinations. They occur so commonly in certain DETERMINATIONS kinds of applications (e.g., defining database designs) that a special syntax is used to write them. We adopt the notation of Davies (1985): Nationality (x, n) + Language(x, 1) As usual, this is simply a syntactic sugaring, but it makes it clear that the determination is really a relationship between the predicates: nationality determines language. The relevant properties determining conductance and density can be expressed similarly: Material(x, m) A Temperature(x, t) F Conductance(x, p) ; Material(x, m) A Temperature(x, t) F Density (x, d) . The corresponding generalizations follow logically from the determinations and observations. We assume for the sake of simplicity that a person speaks only one language. Clearly, the rule also would have to be amended for countries such as Switzerland and India. Section 19.4. Learning Using Relevance Information 695 Determining the hypothesis space Although the determinations sanction general conclusions concerning all Brazilians, or all pieces of copper at a given temperature, they cannot, of course, yield a general predictive theory for all nationalities, or for all temperatures and materials, from a single example. Their main (effect can be seen as limiting the space of hypotheses that the learning agent need consider. In predicting conductance, for example, one need consider only material and tem- perature and can ignore mass, ownership, day of the week, the current president, and so on. Hypotheses can certainly include terms that are in turn determined by material and temper- ature, such as molecular structure, thermal energy, or free-electron density. Determinations specib a suficient basis vocabulary from which to construct hypotlzeses concerning the target predicate. This statement can be proven by showing that a given determination is logically equivalent to a statement that the correct definition of the target predicate is one of the set of all definitioins expressible using the predicates on the left-hand side of the determination. Intuitively, it is clear that a reduction in the hypothesis space size should make it eas- ier to learn the target predicate. Using the basic results of computational learning theory (Section 18.5), we can quantify the possible gains. First, recall that for Boolean Functions, log(lH1) examples are required to converge to a reasonable hypothesis, where jH1 is the size of the hypothesis space. If the learner has n Boolean features with which to construct hy- potheses, then, in the absence of further restrictions, JEII = 0(2"), so the number of exam- ples is 0(2n). If the determination contains d predicates in the left-hand side, the learner will require only (2) examples, a reduction of 0(2n-d). For biased hypothesis spaces, such as a conjunctively biased space, the reduction will be less dramatic, but still significant. Learning and using relevance information As we stated in the introduction to this chapter, prior knowledge is useful in learning, but it too has to be learned. In order to provide a complete story of relevance-based learning, we must therefore provide a learning algorithm for determinations. 'The learning algorithm we now present is based on a straightforward attempt to find the simplest determination consistent with the observations. A determination P + Q says that if any examples match on P, then they must also match on Q. A determination is therefore consistent with a set of examples if every pail. that matches on the predicates on the left-hand side also matches on the target predicate-that is, has the same classification. For example, suppose we have the following examples of conductance measurements on material samples: Copper 100 Lead 0.04 26 Lead 0.05 696 Chapter 19. Knowledge in Learning function MINIMAL-CONSISTENT-DET(E, A) returns a set of attributes inputs: E, a set of examples A, a set of attributes, of size n forii-0, ..., ndo for each subset Ai of A of size i do if CONSISTENT-DET?(A, E) then return Ai function CONSISTENT-DET?(A, E) returns a h-uth-value inputs: A, a set of attributes E, a set of examples local variables: H. a hash table for each example e in E do if some example in H has the same values as e for the attributes A but a different classification then return false store the class of e in H, indexed by the values for attributes A of the example e return true Figure 19.8 An algorithm for finding a minimal consistent determination. The minimal consistent determination is Material A Temperature Conductance. There is a nonminimal but consistent determination, namely, Mass A Size A Temperature Conductance. This is consistent with the examples because mass and size determine density and, in our data set, we do not have two different materials with the same density. As usual, we would need a larger sample set in order to eliminate a nearly correct hypothesis. There are several possible algorithms for finding minimal consistent determinations. The most obvious approach is to conduct a search through the space of determinations, check- ing all determinations with one predicate, two predicates, and so on, until a consistent deter- mination is found. We will assume a simple attribute-based representation, like that used for decision-tree learning in Chapter 18. A determination d will be represented by the set of attributes on the left-hand side, because the target predicate is assumed fixed. The basic algorithm is outlined in Figure 19.8. The time complexity of this algorithm depends on the size of the smallest consistent determination. Suppose this determination has p attributes out of the n total attributes. Then the algorithm will not find it until searching the subsets of A of size p. There are (F) = O(nP) such subsets; hence the algorithm is exponential in the size of the minimal determination. It turns out that the problem is NP-complete, so we cannot expect to do better in the general case. In most domains, however, there will be sufficient local structure (see Chapter 14 for a definition of locally structured domains) that p will be small. Given an algorithm for learning determinations, a learning agent has a way to construct a minimal hypothesis within which to learn the target predicate. For example, we can combine MINIMAL-CONSISTENT-DET with the DECISION-TREE-LEARNING algorithm. This yields a relevance-based decision-tree learning algorithm RBDTL that first identifies a minimal Section 19.5. Inductive Logic Programming 697 RBDTL . - - - - - - . DTL 1 d 0.4 1 0 20 40 60 80 100 120 140 Training set size Figure 19.9 A performance comparison between RBDTL DECISON-TREE-LEARNING on randomly generatled data for a target function that depends on only 5 of 16 attributes. set of relevant attributes and then passes this set to the decision tree algorithm for learning. Unlike DECISION-TREE-LEARNING, RBDTL simultaneously learns and uses relevance in- formation in order to minimize its hypothesis space. We expect that RBDTL will learn faster than DECISION-TREE-LEARNING, and this is in fact the case. Figure 19.9 shows the learning performance for the two algorithms on randomly generated data for a function that depends on only 5 of 16 attributes. Obviously, in cases where all the available attributes are relevant, RBDTL will show no advantage. DECLARATIVE BIAS This section has only scratched the surface of the: field of declarative bias, which aims to understand how prior knowledge can be used to identify the appropriate hypothesis space within which to search for the correct target definition. There are many unanswered questions: How can the algorithms be extended to handle noise? Can we handle continuous-valued variables? How can other kinds of prior knowledge be used, besides determinations? How (can the algorithms be generalized to cover any first-order theory, rather than just an attribute-based representation? Some of these questions are addressed in the next section. Inductive logic programming (ILP) combines inductive: methods with the power of first-order representations, concentrating in particular on the relpresentation of theories as logic pro- gram. It lhas gained popularity for three reasons. First, IL,P offers a rigorous approach to It might be appropriate at this point for the reader to refer to Chapter 9 for some of the underlying concepts, including Horn clauses, conjunctive normal form, unification, and resolution.

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.