Uncertainty in Artificial intelligence Notes

acting under uncertainty discrete bayesian models for mobile-robot navigation, handling uncertainty in artificial intelligence pdf, reasoning under uncertainty in artificial intelligence pdf
HalfoedGibbs Profile Pic
HalfoedGibbs,United Kingdom,Professional
Published Date:02-08-2017
Your Website URL(Optional)
Comment
In which we see what an agent should do when not all is crystal clear 13.1 ACTING UNDER UNCERTAINTY The logical agents described in Parts I11and IV make the epistemological commitment that propositions are true, false, or unknown. When an agent knows enough facts about its en- vironment, the logical approach enables it to derive plans that are guaranteed to work. This is a good thing. Unfortunately, agents almost never have access to the whole truth about their erzvironment. Agents must, therefore, act under uncertainty. For example, an agent in UNCERTAINTY the wumpus world of Chapter 7 has sensors that report only local information; most of the world is not immediately observable. A wumpus agent often will find itself unable to discover which of two squares contains a pit. If those squares are en route to the gold, then the agent might have to take a chance and enter one of the two squares. The real world is far more complex than the wumpus world. For a logical agent, it might be impossible to construct a complete and correct description of how its actions will work. Suppose, for example, that the agent wants to drive someone to the airport to catch a flight and is considering a plan, Ago,that involves leaving home 90 minutes before the flightdeparts and driving at a reasonable speed. Even though the airport is only about 15 miles away, the agent will not be able to conclude with certainty that "Plan Agowill get us to the airport in time." Instead, it reaches the weaker conclusion "Plan Agowill get us to the airport in time, as long as my car doesn't break down or run out of gas, and I don't get into an accident, and there are no accidents on the bridge, and the plane doesn't leave early, and . . . ." None of these conditions can be deduced, so the plan's success cannot be inferred. This is an example of the qualification problem mentioned in Chapter 10. If a logical agent cannot conclude that any particular course of action achieves its goal, then it will be unable to act. Conditional planning can overcome uncertainty to some extent, but only if the agent's sensing actions can obtain the required information and only if there are not too many different contingencies. Another possible solution would be to endow the agent with a simple but incorrect theory of the world that does enable it to derive a plan; Section 13.1. Acting under Uncertainty 463 presumably, such plans will work most of the time, but problems arise when events contradict the agent's theory. Moreover, handling the tradeoff between the accuracy and usefulness of the agent's theory seems itself to require reasoning about uncertainty. In sum, no purely logical agent will be able to conclude that plan Ago is Ithe right thing to do. Nonetheless, let us suppose that Agois in fact the right thang to do. What do we mean by saying this? As we discussed in Chapter 2, we mean that out of all the plans that could be executed, Ago is expected to maximize the agent's performance measure, given the informa- tion it has about the environment. The performance measure includes getting to the airport in time for the flight, avoiding a long, unproductive wait at the airport, and avoiding speeding tickets along the way. The information the agent has cannot guarantee any of these outcomes for Ago, but it can provide some degree of belief that they will be achieved. Other plans, such as Alzo, might increase the agent's belief that it will get to the airport on time, but also increase the likelihood of a long wait. The right thing to do-the rational decisiontherefore depends on both the relative importance of various goals and the likelihood that, and degree to which, they will be achieved. The remainder of this section hones these ideas, in prepara- tion for the development of the general theories of uncertain reasoning and rational decisions that we present in this and subsequent chapters. Handling uncertain knowledge In this section, we look more closely at the nature of uncertaiin knowledge. We will use a sim- ple diagnosis example to illustrate the concepts involved. Diagnosis-whether for medicine, automobile repair, or whatever-is a task that almost always involves uncertainty. Let us try to write rules for dental diagnosis using first-order logic, so that we can see how the logical approach breaks down. Consider the following rule: V p S y m p t o m ( p , Toothache) Disease(p, Cavity) . The problem is that this rule is wrong. Not all patients with toothaches have cavities; some of them have gum disease, an abscess, or one of several other problems: V p S y m p t o m ( p , Toothache) + Disease(p, Cavity) V Disease(p, GumDisease)C' Disease(p, Abscess) . .. Unfortunately, in order to make the rule true, we have to add an almost unlimited list of possible causes. We c o tlry d turning the rule into a causal rule: V y Disease(p, Cavity) + S y m p t o m ( p , toothache: . But this rule is not right either; not all cavities cause pain The only way to fix the rule is to make it logically exhaustive: to augment the left-hand side with all the qualifications required for a cavity to cause a toothache. Even then, for the purposes of diagnosis, one must also take into account the possibility that the patient might have a toothache and a cavity that are unconnected. Trying to use first-order logic to cope with a domain like medical diagnosis thus fails for three main reasons: LAZINESS 0 Laziness: It is too much work to list the complete set of antecedents or consequents needed to ensure an exceptionless rule and too haird to use such rules.464 Chapter 13. Uncertainty THEORETICAL 0 Theoretical ignorance: Medical science has no complete theory for the domain. IGNORANCE PRACTICAL 0 Practical ignorance: Even if we know all the rules, we might be uncertain about a IGNORANCE particular patient because not all the necessary tests have been or can be run. The connection between toothaches and cavities is just not a logical consequence in either direction. This is typical of the medical domain, as well as most other judgmental domains: law, business, design, automobile repair, gardening, dating, and so on. The agent's knowledge DEGREE OF BELIEF can at best provide only a degree of belief in the relevant sentences. Our main tool for dealing PROBABILITY with degrees of belief will be probability theory, which assigns to each sentence a numerical THEORY degree of belief between 0 and 1. (Some alternative methods for uncertain reasoning are covered in Section 14.7.) Probability provides a way of summarizing the uncertainty that comes from our lazi- ness and ignorance. We might not know for sure what afflicts a particular patient, but we believe that there is, say, an 80% chance-that is, a probability of 0.8-that the patient has a cavity if he or she has a toothache. That is, we expect that out of all the situations that are indistinguishable from the current situation as far as the agent's knowledge goes, the patient will have a cavity in 80% of them. This belief could be derived from statistical data-80% of the toothache patients seen so far have had cavities-or from some general rules, or from a combination of evidence sources. The 80% summarizes those cases in which all the factors needed for a cavity to cause a toothache are present and other cases in which the patient has both toothache and cavity but the two are unconnected. The missing 20% summarizes all the other possible causes of toothache that we are too lazy or ignorant to confirm or deny. Assigning a probability of 0 to a given sentence corresponds to an unequivocal belief that the sentence is false, while assigning a probability of 1 corresponds to an unequivocal belief that the sentence is true. Probabilities between 0 and 1 correspond to intermediate degrees of belief in the truth of the sentence. The sentence itself is in fact either true or false. It is important to note that a degree of belief is different from a degree of truth. A probability of 0.8 does not mean "80% true" but rather an 80% degree of belief-that is, a fairly strong expectation. Thus, probability theory makes the same ontological commitment as logic- namely, that facts either do or do not hold in the world. Degree of truth, as opposed to degree of belief, is the subject of fuzzy logic, which is covered in Section 14.7. In logic, a sentence such as "The patient has a cavity" is true or false depending on the interpretation and the world; it is true just when the fact it refers to is the case. In probabil- ity theory, a sentence such as "The probability that the patient has a cavity is 0.8" is about the agent's beliefs, not directly about the world. These beliefs depend on the percepts that EVIDENCE the agent has received to date. These percepts constitute the evidence on which probability assertions are based. For example, suppose that the agent has drawn a card from a shuffled pack. Before looking at the card, the agent might assign a probability of 1/52 to its being the ace of spades. After looking at the card, an appropriate probability for the same proposition would be 0 or 1. Thus, an assignment of probability to a proposition is analogous to saying whether a given logical sentence (or its negation) is entailed by the knowledge base, rather than whether or not it is true. Just as entailment status can change when more sentences are Section 13.1. Acting under Uncertainty 465 added to the knowledge base, probabilities can change when more evidence is acquired.' All probability statements must therefore indicate the evidence with respect to which the probability is being assessed. As the agent receives new percepts, its probability assessments are updated to reflect the new evidence. Before the evidence is obtained, we talk about prior or unconditional probability; after the evidence is obtained, we talk about posterior or conditional probability. In most cases, an agent will have some evidence from its percepts and will be interested in computing the posterior probabilities of the outcomes it cares about. Uncertainty and rational decisions The presence of uncertainty radically changes the wzry an agent makes decisions. A logical agent typically has a goal and executes any plan that is guaranteed to achieve it. An action can be selected or rejected on the basis of whether it achieves the goal, regardless of what other actions might achieve. When uncertainty enters the picture, this is no longer the case. Consider again the Ago plan for getting to the airport. Suppose it has a 95% chance of succeeding. Does this mean it is a rational choice? Not necessarily: There might be other plans, such as Alzo, with higher probabilities of success. If it is vital not to miss the flight, then it is worth risking the longer wait at the airport. What about A1440,a plan that involves leaving home 24 hours in advance? In most circumstances, this is not a good choice, because, although it almost guarantees getting there on time, it lnvolves an intolerable wait. 7 PREFERENCES To make such choices, an agent must first have preferences betu een the different pos- OUTCOMES sible outcomes of the various plans. A particular outcome is a completely specified state, including such factors as whether the agent arrives on time and the length of the wait at the UTILITY THEORY airport. We will be using utility theory to represent and reason with preferences. (The term utility is used here in the sense of "the quality of being useful,"not in the sense of the electric company or water works.) Utility theory says that every state has a degree of usefulness, or utility, to an agent and that the agent will prefer states .withhigher utility. The utility of a state is relative to the agent whose preferences the utility function is supposed to represent. For example, the payoff functions far games in Chapter 6 are utility functions. The utility of a state in which White has won a game of chess is obviously high for the agent playing White, but low for the agent p1,aying Black. Or again, some players (including the authors) might be happy with a draw against the world champion, whereas other players (including the former world champion) might not. There is no accounting for taste or preferences: you might think that an agent who prefers jalapeiio bubble-gum ice cream to chocolate chocolate chip is odd or even misguided, but you could not say the agent is irrational. A utility function can even account for altruistic behavior, simply by including the welfare of others as one of the factors contributing to the agent's own utility. Preferences, as expressed by utilities, are combllned with probabilities in the general DECISIONTHEORY theory of ra.tiona1decisions called decision theory: Decision theory = probability theory + utility theory . This is quite different from a sentence's becoming true or false as the world changes. Handling a changing world via probabilities requires the same kinds of mechanisms-situations, intervals, and events-that we used in Chapter 10 for logical representations. These mechanisms are discussecl in Chapter 15. 466 Chapter 13. Uncertainty The fundamental idea of decision theory is that an agent is rational ifand only if it chooses the action that yields the highest expected utility, averaged over all the possible outcomes of the action. This is called the principle of Maximum Expected Utility (MEU). We saw this principle in action in Chapter 6 when we touched briefly on optimal decisions in backgam- mon. We will see that it is in fact a completely general principle. Design for a decision-theoreticagent Figure 13.1 sketches the structure of an agent that uses decision theory to select actions. The agent is identical, at an abstract level, to the logical agent described in Chapter 7. The primary difference is that the decision-theoretic agent's knowledge of the current state is uncertain; BELIEFSTATE the agent's belief state is a representation of the probabilities of all possible actual states of the world. As time passes, the agent accumulates more evidence and its belief state changes. Given the belief state, the agent can make probabilistic predictions of action outcomes and hence select the action with highest expected utility. This chapter and the next concentrate on the task of representing and computing with probabilistic information in general. Chapter 15 deals with methods for the specific tasks of representing and updating the belief state and predicting the environment. Chapter 16 covers utility theory in more depth, and Chapter 17 develops algorithms for making complex decisions. function D T - A G ( p e r c e re pttu)rns an action static: belief-state, probabilistic beliefs about the current state of the world action, the agent's action update belief-state based on action and percept calculate outcome probabilities for actions, given action descriptions and current belief-state select action with highest expected utility given probabilities of outcomes and utility information return action A decision-theoretic agent that selects rational actions. The steps will be Figure 13.1 fleshed out in the next five chapters. Now that we have set up the general framework for a rational agent, we will need a formal language for representing and reasoning with uncertain knowledge. Any notation for describ- ing degrees of belief must be able to deal with two main issues: the nature of the sentences to which degrees of belief are assigned and the dependence of the degree of belief on the agent's experience. The version of probability theory we present uses an extension of propositionalSection 13.2. Basic Probability Notation 467 logic for its sentences. The dependence on experience is reflected in the syntactic distinc- tion between prior probability statements, which apply before any evidence is obtained, and conditional probability statements, which include the evidence explicitly. Propositions Degrees of belief are always applied to propositionsassertions that such-and-such is the case. So far we have seen two formal languages-propositic)nal logic and first-order logic- for stating propositions. Probability theory typically uses a language that is slightly more expressive than propositional logic. This section describes that language. (Section 14.6 dis- cusses ways to ascribe degrees of belief to assertions in first-order logic.) RANDOM VARIABLE The basic element of the language is the random1variable, which can be thought of as referring to a "part" of the world whose "status" is initially unknown. For example, Cavzty might refer to whether my lower left wisdom tooth has a cavity. Random variables play a role similar to that of CSP variables in constraint satisfaction problems and that of proposition symbols in propositional logic. We will always capitalize i.he names of random variables. (However, we still use lowercase, single-letter names to represent an unknown random vari- able, for example: P ( a )=1- P ( l a ) . ) DOMAIN Each random variable has a domain of values that it can take on. For example, the domain of Cavzty might be (true,f a l e ) ( .W e will use lowercase for the names af values.) The simplest kind of proposition asserts that a random variable has a particular value drawn from its domain. For example, Cavzty = true might represent the proposition that I do in fact have a cavity in my lower left wisdom tooth. As with CSP variables, random variables are typically divided into three lunds, depend- ing on the type of the domain: BOOLEAN RANDOM 0 Boolean random variables, such as Cavity, hxve the domain (true, false). We will VARIABLES often abbreviate a proposition such as Cavity = true simply by the lowercase name cavity. Similarly, Cavity= false would be abbreviated by1cavity. DISCRETE RANDOM Q Discrete random variables, which include Boolean randorn variables as a special case, VARIABLES take on values from a countable domain. For example, the domain of Weather might be (sunny,rainy, cloudy, snow). The values in the domain must be mutually exclusive and exhaustive. Where no confusion arises, we: will use, for example, snow as an abbreviation for Weather= snow. CONTINUOUS 0 Continuous random variables take on values from the:real numbers. The domain can be either the entire real line or some subset such as the interval 0,1. For example, the proposition X = 4.02 asserts that the random variable .X has the exact value 4.02. Propositions concerning continuous random variables can also be inequalities, such as X 5 4.02 With some exceptions, we will be concentrating on the Idl 'screte case. Elementary propositions, such as Cavity = true and Toothache=false, can be com- bined to form complex propositions using all the standard logical connectives. For example, One might expect the domain to be written as a set: true,false). We write it as a tuple because it will be convenient later to impose an ordering on the values.468 Chapter 13. Uncertainty Cavity= true A Toothache =false is a proposition to which one may ascribe a degree of (dis)belief. As explained in the previous paragraph, this proposition may also be written as cavity A l t o o t h a c h e . Atomic events ATOMICEVENT The notion of an atomic event is useful in understanding the foundations of probability the- ory. An atomic event is a complete specification of the state of the world about which the agent is uncertain. It can be thought of as an assignment of particular values to all the vari- ables of which the world is composed. For example, if my world consists of only the Boolean variables Cavity and Toothache, then there are just four distinct atomic events; the proposi- tion Cavity =false A Toothache= true js one such event.3 Atomic events have some important properties: a They are nzutually exclusive-at most one can actually be the case. For example, cavity A toothache and cavity A -toothache cannot both be the case. The set of all possible atomic events is exhaustive-at least one must be the case. That is, the disjunction of all atomic events is logically equivalent to true. Any particular atomic event entails the truth or falsehood of every proposition, whether simple or complex. This can be seen by using the standard semantics for logical con- nectives (Chapter 7). For example, the atomic event cavity A l t o o t h a c h e entails the truth of cavity and the falsehood of cavity + toothache. a Any proposition is logically equivalent to the disjunction of all atomic events that en- tail the truth of the proposition. For example, the proposition cavity is equivalent to disjunction of the atomic events cavity A toothache and cavity A ltoothache. Exercise 13.4 asks you to prove some of these properties. Prior probability UNCONDITIONAL The unconditional or prior probabilityassociated with a proposition a is the degree of belief PRIOR PROBABILITY accorded to it in the absence of any other information; it is written as P ( a ) . For example, if the prior probability that I have a cavity is 0.1, then we would write P ( C a v i t y= t r u e ) = 0.1 or P ( c a v i t y ) = 0.1 . It is important to remember that P ( a ) can be used only when there is no other information. As soon as some new information is known, we must reason with the conditional probability of a given that new information. Conditional probabilities are covered in the next section. Sometimes, we will want to talk about the probabilities of all the possible values of a random variable. In that case, we will use an expression such as P(Weather), which denotes avector of values for the probabilities of each individual state of the weather. Thus, instead Many standard formulations of probability theory take atomic events, also known as sample points, as prim- itive and define a random variable as a function taking an atomic event as input and returning a value from the appropriate domain. Such an approach is perhaps more general, but also less intuitive. Section 13.2. Basic Probability Notation 469 of writing the four equations P ( Weather = sunny) = 0.7 P ( TNeather= rain) = 0.2 P ( Weather= cloudy) = 0.08 P(Weather = snow) = 0.02 . we may simply write P( Weather) = (0.7,0.2,0.08,0.02) . PROBABILITY This statement defines a prior probability distribution for the random variable Weather. DISTRIBUTION We will also use expressions such as P( Weather, Cavity) to denote the probabilities of all combinations of the values of a set of random variable. In that case, P( Weather, Cavity) can be represented by a 4 x 2 table of probabilities. This is called the joint probability dis- tribution of Weather and Cavity. DISTRIBUTION Sometimes it will be useful to think about the complete set of random variables used to describe the world. A joint probability distribution that covers this complete set is called the FULL JOINT PROBABILITY full joint probability distribution. For example, if the world consists of just the variables DISTRIBUTION Cavity, Toothache, and Weather, then the full joint distribution is given by P(Cavity, Toothache, Weather). This joint distribution can be represented as a 2 x 2 x 4 table with 16 entries. A. full joint distribution specifies the probability of every atomic event and is therefore a complete speci- fication of one's uncertainty about the world in question. We will see in Section 13.4 that any probabilistic query can be answered from the full joint distribution. For continuous variables, it is not possible to write out the entire distribution as a table, because there are infinitely many values. Instead, one usually defines the probability that a random variable takes on some value x as a parameterized function of x. For example, let the random variable X denote tomorrow's maximum temperature in Berkeley. Then the sentence P ( X= x ) = U 18 ,2 6(x ) expresses the belief that X is distributed uniformly between 1 1 8and 26 degrees Celsius. (Sev- eral useful continuous distributions are defined in Appendix A.) Probability distributions for PROBABILITY continuous variables are called probability density functions. Density functions differ in meaning from discrete distributions. For example, using the temperature distribution given earlier, we find that P ( X = 20.5)= U18,26( 20.5) ==0.125/C. This does not mean that there's a 12.5% chance that the maximum temperature will be exactly 20.5 degrees tomor- row; the probability that this will happen is of course zero. The technical meaning is that the probability that the temperature is in a small region around 20.5 degrees is equal, in the limit, to 0.125 divided by the width of the region in degrees Celsius: lim P(20.5 X 5 20.5 + d x ) / d x = 0.125/C'. dxto The general notational rule is that the distribution covers all values of the variables that are capitalized. Thus, P(Weather, cavity) is a four-element vector of probabilities for the conjunction of each weather the expression type with Cavity = true.470 Chapter 13. Uncertainty Some authors use different symbols for discrete distributions and density functions; we use P in both cases, since confusion seldom arises and the equations are usually identical. Note that probabilities are unitless numbers, whereas density functions are measured with a unit, in this case reciprocal degrees. Conditional probability Once the agent has obtained some evidence concerning the previously unknown random vari- ables making up the domain, prior probabilities are no longer applicable. Instead, we use CONDITIONAL conditional or posterior probabilities. The notation used is P(alb),where a and b are any PROBABILITY POSTERIOR proposition. This is read as "the probability of a, given that all we know is b." For example, PROBABILITY P(cavity1toothache)= 0.8 indicates that if a patient is observed to have a toothache and no other information is yet avail- able, then the probability of the patient's having a cavity will be 0.8. A prior probability, such as P(cavity),can be thought of as a special case of the conditional probability P(cavity(), where the probability is conditioned on no evidence. Conditional probabilities can be defined in terms of unconditional probabilities. The defining equation is which holds whenever P(b) 0. This equation can also be written as P(a A b)= P(a(b)P(b) PRODUCT RULE which is called the product rule. The product rule is perhaps easier to remember: it comes from the fact that, for a and b to be true, we need b to be true, and we also need a to be true given b. We can also have it the other way around: P(a A b)= P(b(a)P(a.) In some cases, it is easier to reason in terms of prior probabilities of conjunctions, but for the most part, we will use conditional probabilities as our vehicle for probabilistic inference. We can also use the P notation for conditional distributions. P(X1Y)gives the values of P ( X = xi lY = yj) for each possible i, j. As an example of how this makes our notation more concise, consider applying the product rule to each case where the propositions a and b assert particular values of X andY respectively. We obtain the following equations: P ( X = x l A Y = Y =)P ( X = X Y = ) P ( Y = ) . P ( X= x1 A Y= y2)= P ( X =x1(Y=?J.L)P(Y=92) . We can combine all these into the single equation P ( X ,Y )= P ( X I Y ) P ( Y ). Remember that this denotes a set of equations relating the corresponding individual entries in the tables, not a matrix multiplication of the tables. The " 1 " operator has the lowest possible precedence, so P(a A blc V d ) means P ( ( aA b)/( c V d)). Section 13.3. The Axioms of Probability 471 It is tempting, but wrong, to view conditional probabilities as if they were logical im- plications with uncertainty added. For example, the sentence P(al b) = 0.8 cannot be inter- preted to mean "whenever b holds, conclude that P(a)is 0.8." Such an interpretation would be wrong on two counts: first, P ( a ) always denotes the prior probability of a, not the pos- terior probability given some evidence; second, the st,atementP(al b) = 0.8 is immediately relevant just when b is the only available evidence. When additional information c is avail- able, the degree of belief in a is P(alb A c), which might have little relation to P(a1b). For example, c might tell us directly whether a is true or false. If we examine a patient who complains of toothache, and discover a cavity, then we have additional evidence cavity, and we conclude (trivially) that P(cavityltoothache A cavity) = 1.0. So far, we have defined a syntax for propositions and for prior and conditional probability statements about those propositions. Now we must provide some sort of semantics for prob- ability statements. We begin with the basic axioms that serve to define the probability scale and its endpoints: 1. All probabilities are between 0 and 1. For any propositjion a, 2. Necessarily true (i.e., valid) propositions have probability I, and necessarily false (i.e., unsatisfiable) propositions have probability 0. Next, we need an axiom that connects the probabilities of logically related propositions. The simplest way to do this is to define the probability of a disjunction as follows: 3. The probability of a disjunction is given by This rule is easily remembered by noting that the cases where a holds, together with the cases where b holds, certainly cover all the cases where a V b holds; but summing the two sets of cases counts their intersection twice, so we need to subtractY( a A 6). KOLMOGOROV'S These three axioms are often called Kolmogorov's axioms in honor of the Russian AXIOMS mathematician Andrei Kolmogorov, who showed how to build up the rest of probability the- ory from this simple foundation. Notice that the axiorns deal only with prior probabilities rather than conditional probabilities; this is because we have already defined the latter in terms of the former via Equation (13.1).472 Chapter 13. Uncertainty There has been endless debate over the source and status of probability numbers. The frequentist position is that the numbers can come only from experiments: if we test 100 people and find that 10 of them have a cavity, then we can say that the probability of a cavity is approximately 0.1. In this view, the assertion "the probability of a cavity is 0.1" means that 0.1 is the fraction that would be observed in the limit of infinitely many samples. From any finite sample, we can estimate the true fraction and also calculate how accurate our estimate is likely to be. The objectivist view is that probabilities are real aspects of the universe- propensities of objects to behave in certain ways-rather than being just descrip- tions of an observer's degree of belief. For example, that a fair coin comes up heads with probability 0.5 is a propensity of the coin itself. In this view, fre- quentist measurements are attempts to observe these propensities. Most physicists agree that quantum phenomena are objectively probabilistic, but uncertainty at the macroscopic scalee.g., in coin tossing-usually arises from ignorance of initial conditions and does not seem consistent with the propensity view. The subjectivist view describes probabilities as a way of characterizing an agent's beliefs, rather than as having any external physical significance. This allows the doctor or analyst to make the numbers u p t o say, "In my opinion, I expect the probability of a cavity to be about 0.1." Several more reliable techniques, such as the betting systems described on page 474, have also been developed for eliciting probability assessments from humans. In the end, even a strict frequentist position involves subjective analysis, so the difference probably has little practical importance. The reference class problem illustrates the intrusion of subjectivity. Suppose that a frequentist doctor wants to know the chances that a patient has a particular disease. The doctor wants to con- sider other patients who are similar in important ways-age, symptoms, perhaps sex-and see what proportion of them had the disease. But if the doctor consid- ered everything that is known about the patient-weight to the nearest gram, hair color, mother's maiden name, etc.-the result would be that there are no other pa- tients who are exactly the same and thus no reference class from which to collect experimental data. This has been a vexing problem in the philosophy of science. Laplace's principle of indifference (1816) states that propositions that are syntactically "symmetric" with respect to the evidence should be accorded equal probability. Various refinements have been proposed, culminating in the attempt by Carnap and others to develop a rigorous inductive logic, capable of computing the correct probability for any proposition from any collection of observations. Currently, it is believed that no unique inductive logic exists; rather, any such logic rests on a subjective prior probability distribution whose effect is diminished as more observations are collected. Section 13.3. The Axioms of Probability 473 Using the axioms of probability We can derive a variety of useful facts from the basic ,axioms. For example, the familiar rule for negation follows by substituting l a for b in axiom 3, giving us: P ( aV l a ) = P ( a )+ P ( 1 a )- P ( a A l a ) ('by axiom 3 with b= l a ) (Iby logical equivalence) P ( t r u e ) = P ( a )+ P ( 1 a )- P(fa1se) 1 = P ( a )+ P ( 1 a ) (Iby axiom 2) P ( 1 a ) = 1- P ( a ) (lby algebra). The third line of this derivation is itself a useful fact and can be extended from the Boolean case to the general discrete case. Let the discrete variable D have the domain (dl,. . . ,d,). Then it is easy to show (Exercise 13.2) that n That is, any probability distribution on a single variable must sum to It is also true that any joint probability distribution on any set of variables must sum to 1: this can be seen simply by creating a single megavariable whose domain is the cross product of the domains of the the original variables. Recall that any proposition a is equivalent to the disjunction of all the atomic events in which a holds; call this set of events e(a).Recall also that atomic events are mutually exclu- sive, so the probability of any conjunction of atomic events is zero, by axiom 2. Hence, from axiom 3, we can derive the following simple relationship: The probability of a proposition is equal to the sum ofthe probabilities of the atomic events in which it holds; that is, This equation provides a simple method for computing the probability of any proposition, given a full joint distribution that specifies the probabilities of all atomic events. (See Sec- tion 13.4.) In subsequent sections we will derive additional rules for manipulating probabili- ties. First, however, we will examine the foundation for the axioms themselves. Why the axioms of probability are reasonable The axioms of probability can be seen as restricting tiheset of probabilistic beliefs that an agent can hold. This is somewhat analogous to the logical case, where a logical agent cannot simultaneously believe A, B , and l ( AA B ) , for example. There is, however, an additional complication. In the logical case, the semantic definition of conjunction means that at least one of the three beliefs just mentioned must be false in the world, so it is unreasonable for an agent to believe all three. With probabilities, on the other hand, statements refer not to the world directly, but to the agent's own state of knowledge. Why, then, can an agent not hold axiom 3? the following set of beliefs, which clearly violates For continuous variables, the summation is replaced by an integral: S_q)o P 3( X =x)dx = 1. 474 Chapter 13. Uncertainty This lund of question has been the subject of decades of intense debate between those who ad- vocate the use of probabilities as the only legitimate form for degrees of belief and those who advocate alternative approaches. Here, we give one argument for the axioms of probability, first stated in 1931 by Bruno de Finetti. The key to de Finetti's argument is the connection between degree of belief and actions. The idea is that if an agent has some degree of belief in a proposition a, then the agent should be able to state odds at which it is indifferent to a bet for or against a. Think of it as a game between two agents: Agent 1 states "my degree of belief in event a is 0.4." Agent 2 is then free to choose whether to bet for or against a, at stakes that are consistent with the stated degree of belief. That is, Agent 2 could choose to bet that a will occur, betting 4 against Agent 1's 6. Or Agent 2 could bet 6 against 4 that A will not occur.7 If an agent's degrees of belief do not accurately reflect the world, then you would expect that it would tend to lose money over the long run to an opposing agent whose beliefs more accurately reflect the state of the world. But de Finetti proved something much stronger: ZfAgent I expresses a set of degrees of belief that violate the axioms of probability theory then there is a combination of bets by Agent 2 that guarantees that Agent I will lose money every time. So if you accept the idea that an agent should be willing to"put its money where its probabilities are," then you should accept that it is irrational to have beliefs that violate the axioms of probability. One might think that this betting game is rather contrived. For example, what if one refuses to bet? Does that end the argument? The answer is that the betting game is an abstract model for the decision-making situation in which every agent is unavoidably involved at every moment. Every action (including inaction) is a kind of bet, and every outcome can be seen as a payoff of the bet. Refusing to bet is like refusing to allow time to pass. We will not provide the proof of de Finetti's theorem, but we will show an example. Suppose that Agent 1 has the set of degrees of belief from Equation (13.3). Figure 13.2 shows that if Agent 2 chooses to bet 4 on a, 3 on b, and 2 on (V a b), then Agent 1 always loses money, regardless of the outcomes for a and b. Outcome for Agent 1 Agent 1 Agent 2 a A b a A l b la A b 1 a A l b Proposition Belief Bet Stakes -6 -6 4 4 a 0.4 a 4 to 6 b 3 to 7 -7 3 -7 3 b 0.3 2 2 2 -8 a V b 0.8 i ( a V b ) 2 t o 8 -11 -1 -1 -1 Because Agent 1 has inconsistent beliefs, Agent 2 is able to devise a set of Figure 13.2 bets that guarantees a loss for Agent 1, no matter what the outcome of a and b. One might argue that the agent's preferences for different bank balances are such that the possibility of losing 1 is not counterbalanced by an equal possibility of winning 1. One possible response is to make the bet amounts small enough to avoid this problem. Savage's analysis (1954) circumvents the issue altogether. Section 13.4. Inference Using Full Joint Distributions 475 Other strong philosophical arguments have been put forward for the use of probabilities, most notably those of Cox (1946) and Carnap (1950). The world being the way it is, how- ever, practical demonstrations sometimes speak louder than proofs. The success of reasoning systems based on probability theory has been much more effective in making converts. We now look at how the axioms can be deployed to make inferences. In this section we will describe a simple method for probabilistic inference-that is, the computation from observed evidence of posterior probabilities for query propositions. We will use the full joint distribution as the "knowledge base" fi-om which answers to all ques- tions may be derived. Along the way we will also inlroduce several useful techniques for manipulating equations involving probabilities. We begin with a very simple example: a domain consisting of just the three Boolean variables Toothache, Cavity, and Catch (the dentist's nasty steel probe catches in my tooth). The full joint distribution is a 2 x 2 x 2 table as shown Iln Figure 13.3. I 1I toothache II 1t o t h a c h l catch lcatch catch 1catcl pp cavity 0.108 0.012 0.072 0.0108 1cavity 0.016 0.064 0.144 0.576 Figure 13.3 A full joint distribution for the Toothache, Cavity, Catch world. _i Notice that the probabilities in the joint distribution sum to 1, as required by the ax- ioms of probability. Notice also that Equation (13.2) gives us a direct way to calculate the probability of any proposition, simple or complex: We slimplyidentify those atomic events in which the proposition is true and add up their probabilities. For example, there are six atomic events in which cavity V toothache holds: P(cavity V toothache) = 0.108+ 0.012 + 0.072 + 0.008 +0.016+ 0.064 == 0.28 One particularly common task is to extract the distribution over some subset of variables or a single variable. For example, adding the entries in the first row gives the uncondiltionalor MARGINAL PROBABILITY marginal probability8 of cavity: P(cavity)= 0.108 + 0.012 + 0.072+ 0.008 = 0.2 . MARGINALIZATION This process is called marginalization, or summing out-because the variables other than Cavity are summed out. We can write the following general marginalization rule for any sets of variables Y and Z: P(Y)=cP(Y,z) . (13.4) z SO called because of a common practice among actuaries of writing the sums of observed frequencies in the margins of insurance tables.476 Chapter 13. Uncertainty That is, a distribution over Y can be obtained by summing out all the other variables from any joint distribution containing Y. A variant of this rule involves conditional probabilities instead of joint probabilities, using the product rule: CONDITIONING This rule is called conditioning. Marginalization and conditioning will turn out to be useful rules for all kinds of derivations involving probability expressions. In most cases, we will be interested in computing conditional probabilities of some variables, given evidence about others. Conditional probabilities can be found by first us- ing Equation (13.1) to obtain an expression in terms of unconditional probabilities and then evaluating the expression from the full joint distribution. For example, we can compute the probability of a cavity, given evidence of a toothache, as follows: P(cavity A toothache) P(cavity1toothache) = P(toothache) Just to check, we can also compute the probability that there is no cavity, given a toothache: P(1cavity A toothache) P ( c a v i t ytloothache) = P(toothache) Notice that in these two calculations the term l/P(toothache) remains constant, no matter NORMALIZATION which value of Cavity we calculate. In fact, it can be viewed as a normalization constant for the distribution P( Cavity / toothache), ensuring that it adds up to 1. Throughout the chapters dealing with probability, we will use a to denote such constants. With this notation, we can write the two preceding equations in one: P( Cavity (toothache)= a P(Cavity, toothache) = a P(Cavity,toothache, catch)+ P(Cavity, toothache, i c a t c h ) = a(0.108,0.016)+ (0.012,0.064)= a(0.12,0.08)= (0.6,0.4). Normalization will turn out to be a useful shortcut in many probability calculations. From the example, we can extract a general inference procedure. We will stick to the case in which the query involves a single variable. We will need some notation: let X be the query variable (Cavity in the example), let E be the set of evidence variables (just Toothache in the example), let e be the observed values for them, and let Y be the remaining unobserved variables (just Catch in the example). The query is P(X1e)and can be evaluated as where the summation is over all possible ys (i.e., all possible combinations of values of the unobserved variables Y). Notice that together the variables X, E, and Y constitute the com- plete set of variables for the domain, so P ( X ,e , y) is simply a subset of probabilities from the full joint distribution. The algorithm is shown in Figure 13.4. It loops over the valuesSection 13.5. Independence 477 function ENUMERATE-JOINT-AsKe, (X P,)returns a distribution over X inputs: X , the query variable e, observed values for variables E P, a joint distribution on variablesX) U E LI Y // Y = hidden variables / Q ( X )ca distribution over X, initially empty for each value xi of X do Q(xi)c ENUMERATE-JOINTe (,X Y,,I, P) return WORMALIZE(Q(X)) function ENUMERATE- JOINT(X e, , vars, values, P) returns a real number if E M P T Y ? ( V Uth e n)return P(x,e, tlalues) Yc FIRsT(vars) returnx, ENUMERATE-JOINTe, (X R,E S T ( V TYS I )v ,alues,P) Figure 13.4 An algorithm for probabilistic inference by enumeration of the entries in a full joint:distribution. of X and the values of Y to enumerate all possible atomic ekents with e fixed, adds up their probabilities from the joint table, and normalizes the results. Given the full joint distribution to work with, ENUMERATE-JOINT-ASK is a complete algorithm for answering probabilistic queries for discrete variables. It does not scale well, however: For a domain described by n Boolean variables, it requires an input table of size 0 ( 2 n ) and takes 0 ( 2 n ) time to process the table. In a realistic problem, there might be hundreds or thousands of random variables to consider, not just three. It quickly becomes completely impractical to define the vast numbers of probabilities required-the experience needed in order to estimate each of the table entries separately simply cannot exist. For these reasons, the full joint distribution in tabular form is not a practical tool for building reasoning systems (although the historical notes at the end of the chapter includes one real-world application of this method). Instead, it should be viewed as the theoretical foundation o m which more effective approaches may be built. The remainder of this chapter jmtroduces some of the basic ideas required in preparation for the development of realistic systems in Chapter 14. Let us expand the full joint distribution in Figure 13.3 by adding a fourth variable, Weather. The full joint distribution then becomes P(Toothache, Catch, Cavity, Weather),which has 32 entries (because Weather has four values). It contains four "editions"of the table shown in Figure 13.3, one for each kind of weather. It seems niatural to ask what relationship these editions have to each other and to the original three-variable table. For example, how are P(toothache, catch, cavity, Weather= cloudy) and P(toothache, catch, cavity) related? 478 Chapter 13. Uncertainty One way to answer this question is to use the product rule: P(toothache, catch, cavity, Weather = cloudy) = P(Weather = cloudy1toothache, catch, cavity)P(toothache, catch, cavity) . Now, unless one is in the deity business, one should not imagine that one's dental problems influence the weather. Therefore, the following assertion seems reasonable: P(Weather= cloudy Itoothache, catch, cavity) = P ( Weather= cloudy). (13.7) From this, we can deduce P(toothache, catch, cavity, weather= cloudy) = P(Weather= cloudy)P(toothache,catch, cavity) . A similar equation exists for every entry in P(Toothache, Catch, Cavity, Weather). In fact, we can write the general equation P(Toothache,Catch, Cavity, Weather)= P(Toothache, Catch, Cavity)P(Weather) . Thus, the 32-element table for four variables can be constructed from one 8-element table and one four-element table. This decomposition is illustrated schematically in Figure 13.5(a). INDEPENDENCE The property we used in writing Equation (13.7) is called independence(alsomarginal independence and absolute independence). In particular, the weather is independent of one's dental problems. Independence between propositions a and b can be written as P ( a b )= P ( a ) or P ( b j a )= P ( b ) or P ( a A b) = P ( a ) P ( b ). (13.8) All these forms are equivalent (Exercise 13.7). Independence between variables X and Y can be written as follows (again, these are all equivalent): P ( X J Y )= P ( X ) or P ( Y I X )= P(Y) or P ( X ,Y )= P ( X ) P ( Y ). Cavity Coin, ...C . oin, Toothache Catch Weather decomposes decomposes into into Coin, 0 ( 4 (b) Two examplesof factoring a large joint distribution into smaller distributions, Figure 13.5 using absolute independence. (a) Weather and dental problems are independent. (b) Coin flips are independent. Section 13.6. Bayes' Rule and Its Use 479 Independence assertions are usually based on knowledge of the domain. As we have seen, they can dramatically reduce the amount of infbrmation necessary to specify the full joint distribution. If the complete set of variables can be divided into independent subsets, then the full joint can be factored into separate joint distributions on those subsets. For example, the joint distribution on the outcome of n iindependent coin flips, P(C1,. . ., C,), can be represented as the product of n single-variable distributions P(Ci).In a more practical vein, the independence of dentistry and meteorology isa good thing, because otherwise the practice of dentistry might require intimate knowledge: of mzteorology andvice versa. When they are available, then, independence assertions can help in reducing the size of the domain representation and the complexity of the inference problem. Unfortunately, clean separation of entire sets of variables by independence is quite rare. Whenever a connection, however indirect, exists between two variables, independence will fail to hold. Moreover, even independent subsets can be quite large-for example, dentistry might involve dozens of diseases and hundreds of symptoms, all of which are interrelated. To handle such problems, we will need more subtle methods than the straightforward concept of independence. 13.6 BA.YESR ' ULE AND ITS USE On page 470, we defined the product rule and pointed out thaitt can be written in two forms because of the commutativity of conjunction: Equating the two right-hand sides and dividing by P(a),we get BAYESRULE This equation is known as Bayes' rule (also Bayes' law or Bayes' t h e r e m ) .This simple equation underlies all modern A1 systems for probabilistic inference. The more general case of multivalued variables can be written in the P notatioin as where again this is to be taken as representing a set of equations, each dealing with specific values of the variables. We will also have occasion to use a rnore general version condition- alized on some background evidence e: According to rule 1 on page 1 of Strunk and White's The Elements of Style, it should be Bayes's rather than Bayes'. The latter is, however, more commonly used.480 Chapter 13. Uncertainty Applying Bayes' rule: The simple case On the surface, Bayes' rule does not seem very useful. It requires three terms-a conditional probability and two unconditional probabilities-just to compute one conditional probability. Bayes' rule is useful in practice because there are many cases where we do have good probability estimates for these three numbers and need to compute the fourth. In a task such as medical diagnosis, we often have conditional probabilities on causal relationships and want to derive a diagnosis. A doctor knows that the disease meningitis causes the patient to have a stiff neck, say, 50% of the time. The doctor also knows some unconditional facts: the prior probability that a patient has meningitis is 1150,000, and the prior probability that any patient has a stiff neck is 1120. Letting s be the proposition that the patient has a stiff neck and rn be the proposition that the patient has meningitis, we have That is, we expect only 1 in 5000 patients with a stiff neck to have meningitis. Notice that, even though a stiff neck is quite strongly indicated by meningitis (with probability 0.5), the probability of meningitis in the patient remains small. This is because the prior probability on stiff necks is much higher than that on meningitis. Section 13.4 illustrated a process by which one can avoid assessing the probability of the evidence (here, P(s))by instead computing a posterior probability for each value of the query variable (here, m and l m ) and then normalizing the results. The same process can be applied when using Bayes' rule. We have P(Mls)= a (P(slm)P(mP ),( s l i m ) P ( i m ).) Thus, in order to use this approach we need to estimate P(sJ i m ) instead of P(s).There is no free lunch-sometimes this is easier, sometimes it is harder. The general form of Bayes' rule with normalization is (13.11) P(Y1X)= aP(XIY)P(Y) , where a is the normalization constant needed to make the entries in P(Y IX) sum to 1. One obvious question to ask about Bayes' rule is why one might have available the conditional probability in one direction, but not the other. In the meningitis domain, perhaps the doctor knows that a stiff neck implies meningitis in 1 out of 5000 cases; that is, the doctor has quantitative information in the diagnostic direction from symptoms to causes. Such a doctor has no need to use Bayes' rule. Unfortunately, diagnostic knowledge is often more fragile than causal knowledge. If there is a sudden epidemic of meningitis, the unconditional probability of meningitis, P(rn),will go up. The doctor who derived the diagnostic proba- bility P(m1s)directly from statistical observation of patients before the epidemic will have no idea how to update the value, but the doctor who computes P(mls)from the other three values will see that P(m1s)should go up proportionately withP(772). Most importantly, the Section 13.6. Bayes' Rule and Its Use 481 causal information P(slm)is unaflected by the epidemic, because it simply reflects the way meningitis works. The use of this kind of direct causal or model-based knowledge provides the crucial robustness needed to make probabilistic systems feasible in the real world. Using Bayes' rule: Combining evidence We have seen that Bayes' rule can be useful for answering probabilistic queries conditioned on one piece of evidence-for example, the stiff neck. In particular, we have argued that probabilistic information is often available in the form P(eflect1cause). What happens when we have two or more pieces of evidence? For example, what can a dentist conclude if her nasty steel probe catches in the aching tooth of a patient? If we know the full joint distribution (Figure 13.3),one can read off the answer: P(Cavity(toothacheA catch)= a (0.108,0.016) FZ (0.8'71,0.129) . We know, however, that such an approach will not scale up to larger numbers of variables. We can try using Bayes' rule to reformulate the problem: P(Cavity(toothacheA catch)= aP(toothache\I catch1 Cavity)P(Cavity). (13.12) For this reformulation to work, we need to know the conditional probabilities of the conjunc- tion toothache A catch for each value of Cavity. That might be feasible for just two evidence variables, but again it will not scale up. If there are n possible evidence variables (X rays, diet, oral hygiene, etc.), then there are 2n possible combinations of observed values for which we would need to know conditional probabilities. We might as well go back to using the full joint distribution. This is what first led researchers away from probability theory toward approximate methods for evidence combination that, while giving incorrect answers, require fewer numbers to give any answer at all. Rather than taking this route, we need to find some additional assertions about the domain that will enable us to simplify the expressions. The notion of independencein Sec- tion 13.5 provides a clue, but needs refining. It would be nice if Toothache and Catch were independent, but 'they are not: if the probe catches in the tooth, it probably has a cavity and that probably causes a toothache. These variables are independent, however, given the presence or the absence of a cavity. Each is directly caused by the cavity, but neither has a direct effect on the other: toothache depends on the state of the nerves in the tooth, whereas the probe's accuracy depends on the dentist's skill, to which the toothache is irrelevant.1° Mathematically, this property is written as P(tooChache A catch ( Cavity)= P(toothache1Cavity)P(catchC ( avity) . (13.13) CONDITIONAL This equation expresses the conditionalindependence of toothache and catch given Cavity. We can plug it into Equation (13.12) to obtain the probability of a cavity: P(Cavity)toothache A catch)= a P(toothache / Cavity)P(catch( Cavity)P(Cavity). Now the information requirements are the same as for inference using each piece of evi- dence separately: the prior probability P(Cavity)for the query variable and the conditional probability of each effect, given its cause. We assume that the patient and dentist are distinct individuals.

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.