Reinforcement learning Tutorial

reinforcement learning example,reinforcement learning q learning example, reinforcement learning question answering,reinforcement learning vs supervised learning
HalfoedGibbs Profile Pic
HalfoedGibbs,United Kingdom,Professional
Published Date:02-08-2017
Your Website URL(Optional)
Comment
In which we examine how an agent can learn ffiom succe:i.s and failure, from re- ward and punishment. Chapters 18 and 20 covered learning methods that learn finctions and probability models from example. In this chapter, we will study how agents can learn what to do, particularly when there is no teacher telling the agent what action to take in each circumstance. For example, we know an agent can learn to pllay chess by supervised learning-by being given examples of game situations along with the best moves for those situations. But if there is no friendly teacher providing examples, what can the agent do? By trying random moves, the agent can eventually build a predictive model of its environment: what the board will be like after it makes a given move and even how the opponent is likely to reply in a given situation. The problem is this: without some feedback about what is good and what is bad, the agent will have no grounds for deciding which move to make. The agent needs to know that something good has happened when it wins and that something bad has happened when it loses. This kind of feedback is called a reward, or reinforcement. In games like chess, the reinforcement is received only at the end of the game. In other environments, the rewards come more frequently. In ping-pong, each point scored can be considered a reward; when learnnng to crawl, any forward motion is an achievement. Our framework for agents regards the reward as part of the input percept, but the agent must be "hardwired to recognize that part as a reward rather than as just another sensory input. Thus, animals seem to be hardwired to recognize pain and hunger as negative rewards and pleasure and food intake as positive rewards. Reinforcement has been carefully studied by animal psychologists for over 60 years. Rewards were introduced in Chapter 17, where .hey served to define optimal policies in Markov decision processes (MDPs). An optimal policy is a policy that maximizes the expected total reward. The task of reinforcement learning is to use observed rewards to learn an optimal (or nearly optimal) policy for the environment. Vu'hereas in Chapter 17 the agent 764 Chapter 21. Reinforcement Learning has a complete model of the environment and knows the reward function, here we assume no prior knowledge of either. Imagine playing a new game whose rules you don't know; after a hundred or so moves, your opponent announces, "You lose." This is reinforcement learning in a nutshell. In many complex domains, reinforcement learning is the only feasible way to train a program to perform at high levels. For example, in game playing, it is very hard for a human to provide accurate and consistent evaluations of large numbers of positions, which would be needed to train an evaluation function directly from examples. Instead, the program can be told when it has won or lost, and it can use this information to learn an evaluation function that gives reasonably accurate estimates of the probability of winning from any given position. Similarly, it is extremely difficult to program an agent to fly a helicopter; yet given appropriate negative rewards for crashing, wobbling, or deviating from a set course, an agent can learn to fly by itself. Reinforcement learning might be considered to encompass all of AI: an agent is placed in an environment and must learn to behave successfully therein. To keep the chapter man- ageable, we will concentrate on simple settings and simple agent designs. For the most part, we will assume a fully observable environment, so that the current state is supplied by each percept. On the other hand, we will assume that the agent does not know how the environ- ment works or what its actions do, and we will allow for probabilistic action outcomes. We will consider three of the agent designs first introduced in Chapter 2: a A utility-based agent learns a utility function on states and uses it to select actions that maximize the expected outcome utility. 0-LEARNING a A Q-learning agent learns an action-value function, or Q-function, giving the expected ACTION-VALUE utility of taking a given action in a given state. a A reflex agent learns a policy that maps directly from states to actions. A utility-based agent must also have a model of the environment in order to make decisions, because it must know the states to which its actions will lead. For example, in order to make use of a backgammon evaluation function, a backgammon program must know what its legal moves are and how they affect the board position. Only in this way can it apply the utility function to the outcome states. A Q-learning agent, on the other hand, can compare the values of its available choices without needing to know their outcomes, so it does not need a model of the environment. On the other hand, because they do not know where their actions lead, Q-learning agents cannot look ahead; this can seriously restrict their ability to learn, as we shall see. PASSIVE LEARNING We begin in Section 21.2 with passive learning, where the agent's policy is fixed and the task is to learn the utilities of states (or state-action pairs); this could also involve learning ACTIVELEARNING a model of the environment. Section 21.3 covers active learning, where the agent must also learn what to do. The principal issue is exploration: an agent must experience as much as EXPLORATION possible of its environment in order to learn how to behave in it. Section 21.4 discusses how an agent can use inductive learning to learn much faster from its experiences. Section 21.5 covers methods for learning direct policy representations in reflex agents. An understanding of Markov decision processes (Chapter 17) is essential for this chapter. Section 2 1.2. Passive Reinforcement Learning 765 To keep things simple, we start with the case of a passive learning agent using a state-based representation in a fully observable environment. In passivle learning, the agent's policy T is fixed: in state s, it always executes the action (s). Its goal is simply to learn how good T the policy is-that is, to learn the utility function U (s). We .will use as our example the 4 x 3 world. introduced in Chapter 17. Figure 21.1 slhows a policy for that world and the corresponding utilities. Clearly, the passive learning ta.sk is similar to the policy evaluation task, part of the policy iteration algorithm described in Section 17.3. The main difference is that the passive learning agent does not know the transition model T(s, a, s'), which specifies the probability of reaching state s' from state s after doing action a; nor does it reward function R(s), which specifies the reward for each state. know the Figure 21.1 (a) A policy .s? for the 4 x 3 world; this policy happens to be optimal with rewards of R(s) = - 0.04 in the nonterminal states and no discounting. (b) The utilities of the states in the 4 x 3 world, given policy .rr. TRIAL The agent executes a set of trials in the environment using its policy T. In each trial, the agent starts in state (1,l) and experiences a sequence olf state transitions until it reaches one of the terminal states, (4,2) or (4,3). Its percepts supply both the current state and the reward received in that state. Typical trials might look like this: (1, l)-.04+(1, 2)-.04-f(1,3)-.04-+(1, 2)-.04-+(l, 21)-.04-+(2,3)-.04(3, 3)-.04c3(4, 3)+1 (I, l)-.04+(1,2)-.04+(1,3)-.04+(2,3)-.04(3,3)-.04-+(3 2)-.04-+(3,3)-.04-+(4,3)+1 (1,1)-.04+(,)-.+(,)-.04+(,)-.04-+(,)-i . Note that each state percept is subscripted with the reward received. The object is to use the T information about rewards to learn the expected utility U (s) associated with each nontermi- nal state s. The utility is defined to be the expected surn of (discounted) rewards obtained if 766 Chapter 21. Reinforcement Learning policy .ir is followed. As in Equation (17.3) on page 619, this is written as We will include a discount factor y in all of our equations, but for the 4 x 3 world we will set y = 1. Direct utility estimation A simple method for direct utility estimation was invented in the late 1950s in the area ESTIMATION ADAPTIVECONTROL of adaptive control theory by Widrow and Hoff (1960). The idea is that the utility of a THEORY state is the expected total reward from that state onward, and each trial provides a sample of this value for each state visited. For example, the first trial in the set of three given earlier provides a sample total reward of 0.72 for state (1,1), two samples of 0.76 and 0.84 for (1,2), two samples of 0.80 and 0.88 for (1,3), and so on. Thus, at the end of each sequence, the algorithm calculates the observed reward-to-go for each state and updates the estimated utility for that state accordingly, just by keeping a running average for each state in a table. In the limit of infinitely many trials, the sample average will converge to the true expectation in Equation (21.1). It is clear that direct utility estimation is just an instance of supervised learning where each example has the state as input and the observed reward-to-go as output. This means that we have reduced reinforcement learning to a standard inductive learning problem, as discussed in Chapter 18. Section 21.4 discusses the use of more powerful kinds of repre- sentations for the utility function, such as neural networks. Learning techniques for those representations can be applied directly to the observed data. Direct utility estimation succeeds in reducing the reinforcement learning problem to an inductive learning problem, about which much is known. Unfortunately, it misses a very important source of information, namely, the fact that the utilities of states are not indepen- dent The utility of each state equals its own reward plus the expected utility of its successor states. That is, the utility values obey the Bellman equations for a fixed policy (see also Equation (17.10)): (2 1.2) x (s, T(s), sl)IT(sf) . uT(s) = (s) + s' By ignoring the connections between states, direct utility estimation misses opportunities for learning. For example, the seconcl of the three trials given earlier reaches the state (3,2), which has not previously been visited. The next transition reaches (3,3), which is known from the first trial to have a high utility. The Bellman equation suggests immediately that (3,2) is also likely to have a high utility, because it leads to (3,3), but direct utility estimation learns nothing until the end of the trial. More broadly, we can view direct utility estimation as searching in a hypothesis space for U that is much larger than it needs to be, in that it includes many functions that violate the Bellman equations. For this reason, the algorithm often converges very slowly. Section 21.2. Passive Reinforcement Learning 767 Adaptive dynamic programming In order to take advantage of the constraints between states, an agent must learn how states are connected. An adaptive dynamic programming (or AIDP) agent works by learning the &MIC and solving the corresponding Markov transition model of the environment as it goes along decision process using a dynamic programming method. For a passive learning agent, this means plugging the learned transition model T (s, .ir (s) , s ) and the observed rewards R(s) into the Bellman equations (21.2) to calculate the utilities of the states. As we remarked in our discussion of policy iteration in Chapter 17, these equations are linear (no maximization involved) so they can be solved using any linear algebra package. Alternatively, we can adopt f the approach of modified policy iteration (see page 625), using a simplified value iteration process to update the utility estimates after each change to the learned model. Because the model usually changes only slightly with each observation, the value iteration process can use the previous utility estimates as initial values and should converge quite quickly. The process of learning the model itself is easy. because the environment is fully ob- servable. This means that we have a supervised learning task where the input is a state-action pair and the output is the resulting state. In the simplest case, we can represent the tran- sition model as a table of probabilities. We keep traick of how often each action outcome occurs and estimate the transition probability T(s, a, s ) from the frequency with which s' is reached when executing a in s.' For example, in the three traces given on page 765, Right is executed three times in (1,3) and two out of three: times the resulting state is (2,3), so f T((1,3), Right, (2,3)) is estimated to be 213. The full agent program for a passive ADP agent is shown in Figure 21.2. Its perfor- mance on the 4 x 3 world is shown in Figure 21.3. In terms of how quickly its value estimates improve, the ADP agent does as well as possible, subject to its ability to learn the transition model. In this sense, it provides a standard against which to measure other reinforcement learning algorithms. It is, however, somewhat intractable for large state spaces. In backgam- mon, for example, it would involve solving roughly lo5' equations in lo5' unknowns. Temporal difference learning It is possible to have (almost) the best of both worlds; that is, one can approximate the con- straint equations shown earlier without solving them for all possible states. The key is to use the observed transitions to adjust the values of the observed states so that they agree with the corzstraint equations. Consider, for example, the transition from (1,3) to (2,3) in the second trial on page 765. Suppose that, as a result of the filrst trial, the utility estimates are n T U (1,3) = 0.84 and U (2, 3) = 0.92. Now, if this transition occurred all the time, we would expect the utilities to obey so Un(1,3) would be 0.88. Thus, its current estimate of 0.84 might be a little low and should be increased. More generally, when a transition occurs from state s to state s', we apply the This is the maximum likelihood estimate, as discussed in Chapter 20. A. Bayesian update with a Dirichlet prior might work better. 768 Chapter 2 1. Reinforcement Learning function PASSIVE-ADP-AGENT(C) returns an action inputs: percept, a percept indicating the current state st and reward signal r' static: T, a fixed policy mdp, an MDP with model T, rewards R, discount y U, a table of utilities, initially empty N,,, a table of frequencies for state-action pairs, initially zero N,,,/, a table of frequencies for state-action-state triples, initially zero s, a, the previous state and action, initially null if s is new then do Us t r f; Rs c r' if s is not null then do increment N,,s, a and N,,,, s, a, s f for each t such that NSa,rs, a, t is nonzero do f f f Ts, a, tI + Nsasl, a, tI I Nsas, a U c POLICY- EVALUATION(, U, mdp) if TERMINAL?S' then s, a t null else s, a t s , s' return a A passive reinforcement learning agent based on adaptive dynamic program- Figure 21.2 f ming. To simplify the code, we have assumed that each percept can be divided into a per- ceived state and a reward signal. 0.6 - 1 (4 3) ..... ........................................................................ .......................... : .................. -_"_i..___. -. (3,3) 0.5 . , ; . r\ - -. 2 0.8 (13) - (1,l) 2 0.4 - E 0.6 - i i ....... (3,2) 'E I.... 0.3 . e, X w b' 0.4 - 2 0.2 . j 5 Pi 0.2 - j 0.1 - ' 0 - 0 1 0 20 40 60 80 100 0 20 40 60 80 100 Number of trials Number of trials (a (b) The passive ADP learning curves for the 4 x 3 world, given the optimal policy Figure 21.3 shown in Figure 21.1. (a) The utility estimates for a selected subset of states, as a function of the number of trials. Notice the large changes occurring around the 78th trial-this is the first time that the agent falls into the -1 terminal state at (4,2). (b) The root-mean-square error in the estimate for U(1, I), averaged over 20 runs of 100 trials each. .: 769 Section 21.2. Passive Reinforcement Learning function PASSIVE-TD- AN(percept) returns an action inputs: percept, a percept indicating the current state sf and reward signal r' static: n-, a fixed policy U, a table of utilities, initially empty l\i,, a table of frequencies for states, initially zero s, a, r, the previous state, action, and reward, initially null if S' is new then Us1 +- r' if s is not null then do increment N, s Usl+- Us + a(Nss)(r + Y Ws' - Usl) if TERMINAL?S' then s, a, r +null else s, a, r t s f, ,s', rf return a Figure 21.4 A passive reinforcement learning agent that learns utility estimates using tem- poral differences. following update to UX (s) : Here, a is the learning rate parameter. Because this update rule uses the difference in utilities TEMPORAL- between successive states, it is often called the temporal-difference, or TD, equation. DIFFERENCE The basic idea of all temporal-difference methods is, first to define the conditions that hold locally when the utility estimates are correct, and then, lo write an update equation that moves the estimates toward this ideal "equilibrium" equation. In the case of passive learning, the equilibrium is given by Equation (21.2). Now Equation (21.3) does in fact cause the agent to reach the equilibrium given by Equation (21.2), but there is some subtlety involved. First, notice that the update involves only the observed successor st, whereas the actual equilibrium conditions involve all possible next states. One might think that this causes an improperly large change in U" (s) when a very rare transition occurs; but, in fact, because rare transitions occur only rarely, the average value of UT(s) will converge to the correct value. Furthermore, if we change a from a fixed parameter to a function that decreases as the number of times a state has been visited increases, then U(s) itself will converge to the correct value.2 This gives us the agent program shown in Figure 21.4. Figuire 21 .:i illustrates the performance of the passive TD agent on the 4 x 3 world. It does not learn quite as fast as the ADP agent and shows much higher variability, but it is much simpler and requires much less computation per observation. Notice that TD does not need a model to peorm its updates. The environment supplies the connection between neighboring states in the fonn of observed transitions. The ADP approach and the TD approach are actually closely related. Both try to make local adjustments to the utility estimates in order to make each state "agree" with its succes- sors. One difference is that TD adjusts a state to agree with its observed successor (Equa- Technically, we require that xr= , a(n) = cc and xr= a2 (n) m. The decay a(n) = 1/n satisfies l these conditions. In Figure 21.5 we have used a(n) = 60/(59 + n) 770 Chapter 2 1. Reinforcement Learning 0 100 200 300 400 500 0 20 40 60 80 100 Number of trials Number of trials (a (b) Figure 21.5 The TD learning curves for the 4 x 3 world. (a) The utility estimates for a selected subset of states, as a function of the number of trials. (b) The root-mean-square error in the estimate for U(1, I), averaged over 20 runs of 500 trials each. Only the first 100 trials are shown to enable comparison with Figure 21.3. tion (21.3)), whereas ADP adjusts the state to agree with all of the successors that might occur, weighted by their probabilities (Equation (21.2)). This difference disappears when the effects of TD adjustments are averaged over a large number of transitions, because the frequency of each successor in the set of transitions is approximately proportional to its prob- ability. A more important difference is that whereas TD makes a single adjustment per ob- served transition, ADP makes as many as it needs to restore consistency between the utility estimates U and the environment model T. Although the observed transition makes only a local change in T, its effects might need to be propagated throughout U. Thus, TD can be viewed as a crude but efficient first approximation to ADP. Each adjustment made by ADP could be seen, from the TD point of view, as a re- sult of a "pseudo-experience" generated by simulating the current environment model. It is possible to extend the TD approach to use an environment model to generate several pseudo-experiences-transitions that the TD agent can imagine might happen, given its cur- rent model. For each observed transition, the TD agent can generate a large number of imag- inary transitions. In this way, the resulting utility estimates will approximate more and more closely those of ADP- of course, at the expense of increased computation time. In a similar vein, we can generate more efficient versions of ADP by directly approx- imating the algorithms for value iteration or policy iteration. Recall that full value iteration can be intractable when the number of states is large. Many of the adjustment steps, however, are extremely tiny. One possible approach to generating reasonably good answers quickly is to bound the number of adjustments made after each observed transition. One can also use a heuristic to rank the possible adjustments so as to carry out only the most significant ones. The prioritized sweeping heuristic prefers to make adjustments to states whose likely suc- cessors have just undergone a large adjustment in their own utility estimates. Using heuristics like this, approximate ADP algorithms usually can learn roughly as fast as full ADP, in terms Section 2 1.3. Active R-einforcement Learning 77 1 of the number of training sequences, but can be several orders of magnitude more efficient in terms of computation. (See Exercise 21.3.) This enables them to handle state spaces that are far too large for full ADP. Approximate ADP algorithms have an additional advantage: in the early stages of learning a new environment, the environment model T often will be far from correct, so there is little point in calculating an exact utility function to match it. An ap- proximation algorithm can use a minimum adjustment ;size that decreases as the environment model becomes more accurate. This eliminates the very long value iterations that can occur early in learning due to large changes in the model. A passive learning agent has a fixed policy that determines its behavior. An active agent must decide what actions to take. Let us begin with the adaptive d:ynamnic programming (agent and consider how it must be modified to handle this new freedom. First, the agent will need to learn a complete moldel wjth outcome probabilities for all actions, rather than just the model for the fixed policy. The simple learning mechanism used by PASSIVE-ADP-AGENT will do just fine for this. Next, vie need to take into account the fact that the agent has a choice of actions. The utilities it needs to learn are those defined by the optimal policy; they obey the Bellman equations given on page 619, which we repeat here: U(s) = R(s) + y max T(s, a, sl)U(sl) . (21.4) S' These equations can be solved to obtain the utility fuinction U using the value iteration or policy iteration algorithms from Chapter 17. The final issue is what to do at each step. Having obtained a utility function U that is optimal for the learned model, the agent can extract an optimal action by one-step look-ahead to maximize the expected utility; alternatively, if it uses policy iteration, the optimal policy is already available, so it should simply execute the action the optimal policy recommends. Or should it? Exploration Figure 21.6 shows the results of one sequence of trials for an ADP agent that follows the recommendation of the optimal policy for the learned model at each step. The agent does not learn the true utilities or the true optimal policy What happens instead is that, in the 39th trial, it finds a policy that reaches the +1 reward along the lower route via (2,1), (3, I), (3,2), and (3,3). (S'ee Figure 21.6.) After experimenting with minor variations, from the 276th trial onward it sticks to that policy, never learning the utilities of the other states and never finding GREEDYAGENT the optimal route via (1,2), (1,3), and (2,3). We call this agent the greedy agent. Repeated experiments show that the greedy agent very seldom converges to the optimal policy for this environment and sometimes converges to really horrenclous policies. How can it be that choosing the optimal action leads to suboptimal results? The answer is that the learned model is not the same as the true environment; what is optim.al in the 772 Chapter 21. Reinforcement Learning & RMS error - Policy loss ,bC,-................. 5 - Number of trials 1 2 3 4 (a (b) Figure 21.6 Performance of a greedy ADP agent that executes the action recommended by the optimal policy for the learned model. (a) RMS error in the utility estimates averaged over the nine nonterminal squares. (b) The suboptimal policy to which the greedy agent converges in this particular sequence of trials. learned model can therefore be suboptimal in the true environment. Unfortunately, the agent does not know what the true environment is, so it cannot compute the optimal action for the true environment. What, then, is to be done? What the greedy agent has overlooked is that actions do more than provide rewards according to the current learned model; they also contribute to learning the true model by af- fecting the percepts that are received. By improving the model, the agent will receive greater EXPLOITATION rewards in the fture. An agent therefore must make a trade-off between exploitation to maximize its reward-as reflected in its current utility estimates-and exploration to maxi- EXPLORATION mize its long-term well-being. Pure exploitation risks getting stuck in a rut. Pure exploration to improve one's knowledge is of no use if one never puts that knowledge into practice. In the real world, one constantly has to decide between continuing in a comfortable existence and striking out into the unknown in the hopes of discovering a new and better life. With greater understanding, less exploration is necessary. Can we be a little more precise than this? Is there an optimal exploration policy? It turns out that this question has been studied in depth in the subfield of statistical decision BANDIT PROBLEMS theory that deals with so-called bandit problems. (See sidebar.) Although bandit problems are extremely difficult to solve exactly to obtain an optimal exploration method, it is nonetheless possible to come up with a reasonable scheme that will eventually lead to optimal behavior by the agent. Technically, any such scheme needs GLIE to be greedy in the limit of infinite exploration, or GLIE. A GLIE scheme must try each action in each state an unbounded number of times to avoid having a finite probability that an optimal action is missed because of an unusually bad series of outcomes. An ADP agent Notice the direct analogy to the theory of information value in Chapter 16. Section 2 1.3. Active Reinforcement Learning 773 EXPLORATION AND BANDITS In Las Vegas, a one-armed bandit is a slot machine. A gambler can insert a coin, pull the lever, and collect the winnings (if any). An n-armed bandit has n levers. The gambler must choose which lever to play on each successive coin-the one that has paid off best, or maybe one that has not been tried? The n-armed bandit problem is a formal model for real problems in many vi- tally important areas, such as deciding on the annual budget for A1 research and development. Each arm corresponds to an action (such as allocating 20 million for the development of new A1 textbooks), and the payoff from pulling the arm cor- responds to the benefits obtained from taking the action (immense). Exploration, whetheir it is exploration of a new research field or exploration of a new shopping mall, is risky, is expensive, and has uncertain payoffs; on the other hand, failure to explore at all means that one never discovers any actions that are worthwhile. To formulate a bandit problem properly, one must define exactly what is meant by optiimal behavior. Most definitions in the literature assume that the aim is to maximize the expected total reward obtained over 'the agent's lifetime. These defi- nitions require that the expectation be taken over the possible worlds that the agent could be in, as well as over the possible results of each aclion sequence in any given world. Here, a "world is defined by the transition model T(s, a, s'). Thus, in order to act optimally, the agent needs a prior distribution over the possible models. 'The resulting optimization problems are usually wildly intractable. In some cases-for example, when the payoff of each machine is independent and discounted rewards are used-it is possible to calculate a Gittins index for each slot machine (Gittins, 1989). The index is a function only of the numbeir of times the slot machine has been played and how much it has paid off. The index for each machine indicates how worthwhile it is to invest more, based on a combination of expected return and expected value of information. Cloosing the machine with the highest index value gives an optimal exploration policy. TJnfortunately, no way has been found to extend Gittins indices to sequential decision problems. One can use the theory of n-armed bandits to argue fior the reasonableness of the selection strategy in genetic algorithms. (See Chapter 4.) If you consider each arm in an n-armed bandit problem to be a possible string of genes, and the investment of a coin in one arm to be the reproduction of those genes, then ge- netic algorithms allocate coins optimally, given an appropriate set of independeince assumptions. 774 Chapter 21. Reinforcement Learning using such a scheme will eventually learn the true environment model. A GLIE scheme must also eventually become greedy, so that the agent's actions become optimal with respect to the learned (and hence the true) model. There are several GLIE schemes; one of the simplest is to have the agent choose a ran- dom action a fraction l/t of the time and to follow the greedy policy otherwise. While this does eventually converge to an optimal policy, it can be extremely slow. A more sensible approach would give some weight to actions that the agent has not tried very often, while tending to avoid actions that are believed to be of low utility. This can be implemented by altering the constraint equation (21.4) so that it assigns a higher utility estimate to relatively unexplored state-action pairs. Essentially, this amounts to an optimistic prior over the possi- ble environments and causes the agent to behave initially as if there were wonderful rewards scattered all over the place. Let us use U+(s) to denote the optimistic estimate of the utility (i.e., the expected reward-to-go) of the state s, and let N(a, s) be the number of times action a has been tried in state s. Suppose we are using value iteration in an ADP learning agent; then we need to rewrite the update equation (i.e., Equation (17.6)) to incorporate the optimistic estimate. The following equation does this: EXPLORATION Here, f (u, n) is called the exploration function. It determines how greed (preference for FUNCTION high values of u) is traded off against curiosity (preference for low values of n-actions that have not been tried often). The function f (u, n) should be increasing in u and decreasing in n. Obviously, there are many possible functions that fit these conditions. One particularly simple definition is R+ if n Ne f (u, n) = u otherwise where R+ is an optimistic estimate of the best possible reward obtainable in any state and N, is a fixed parameter. This will have the effect of making the agent try each action-state pair at least Ne times. The fact that U+ rather than U appears on the right-hand side of Equation (21.5) is very important. As exploration proceeds, the states and actions near the start state might well be tried a large number of times. If we used U, the more pessimistic utility estimate, then the agent would soon become disinclined to explore further afield. The use of U+ means that the benefits of exploration are propagated back from the edges of unexplored regions, so that actions that lead toward unexplored regions are weighted more highly, rather than just actions that are themselves unfamiliar. The effect of this exploration policy can be seen clearly in Figure 21.7, which shows a rapid convergence toward optimal performance, unlike that of the greedy approach. A very nearly optimal policy is found after just 18 trials. Notice that the utility estimates themselves do not converge as quickly. This is because the agent stops exploring the unrewarding parts of the state space fairly soon, visiting them only "by accident" thereafter. However, it makes perfect sense for the agent not to care about the exact utilities of states that it knows are undesirable and can be avoided. Section 21.3. Active Reinforcement Learning 775 (1,l) - (132) (1,3) ......... (2 3) ............. (3 2) (3:3) .. .- . (4 3) ......-. ....................................................... - -.- ,-,-.-.-.-.-,-..-.-.-. .- ............... ". ....-...... .- .-.-.-.-.- .. - 0 20 40 60 80 100 0 20 40 60 80 100 Number of trials Number of trials Performance of the exploratory ADP agent. using R+ = 2 and N, = 5. (a) Figure 21.7 Utility estimates for selected states over time. (b) The RMS error in utility values and the associated policy loss. Learning an Action-Value Function Now that we have an active ADP agent, let us considtx how to construct an active iemporal- difference learning agent. The most obvious change from the passive case is that the agent is no longer equipped with a fixed policy, so, if it learns a utility function U, it will need to learn a model in order to be able to choose an action based on U via one-step look-ahead. The model acquisition problem for the TD agent is idelltical to that for the ADP agent. What of the TD update rule itself? Perhaps surprisingly, the update rule (21.3) remains unchanged. This might seem odd, for the following reason: Suppose the (agent takes a step that normally leads to a gohod destination, but because of nondeterminism in the environment the agent ends up in a catastrophic state. The TD update rule will take this as seriously as if the outcome had been the normal result of the action, whereas one might suppose that, because the outcome was a fluke, the agent should not worry about it too much. In fact, of course, the unlikely outcome wilil occur only infrequently in a large set of lraining sequences; hence in the long run its effects will be weighted proportionally to its probability, as we would hope. Once again, it can be shown that the TD algorithm will converge to the same values as AIIP as the number of training sequences tends to infinity. There is an alternative TD method called Q-learning that learns an action-value repre- sentation instead of learning utilities. We will use the notatiol Q(a, s) to denote the value of doing action a in state s. Q-values are directly related to utility values as follows: U (s) = max Q (a, s) . (2 1.6) a Q-functions may seem like just another way of storing utility information, but they have a very important property: a TD agent that learns a Q-func,fion does not need a model for either learning or action selection. For this reason, Q-le,arning is called a model-free method. MODEL-FREE As with utilities, we can write a constraint equation that must hold at equilibrium -when the 776 Chapter 21. Reinforcement Learning function Q-LEARNING-AGENT() returns an action inputs: percept, a percept indicating the current state s' and reward signal r' static: Q, a table of action values index by state and action N,,, a table of frequencies for state-action pairs S, a, r, the previous state, action, and reward, initially null if s is not null then do increment Nsa s, a &a, S + &a, s + (Nsajs, al)(r + 7 maxar &a', sf - QIa, sI) if TERMINAL?S' then s, a, r + null else s, a, r t s', argmax,, f (&a', s', NSas1, a'), rr return a Figure 21.8 An exploratory Q-learning agent. It is an active learner that learns the value Q(a, s) of each action in each situation. It uses the same exploration function f as the ex- ploratory ADP agent, but avoids having to learn the transition model because the Q-value of a state can be related directly to those of its neighbors. Q-values are correct: As in the ADP learning agent, we can use this equation directly as an update equation for an iteration process that calculates exact Q-values, given an estimated model. This does, however, require that a model also be learned because the equation uses T(s, a, s'). The temporal-difference approach, on the other hand, requires no model. The update equation for TD Q-learning is which is calculated whenever action a is executed in state s leading to state s'. The complete agent design for an exploratory Q-learning agent using TD is shown in Figure 21.8. Notice that it uses exactly the same exploration function f as that used by the exploratory ADP agent-hence the need to keep statistics on actions taken (the table N). If a simpler exploration policy is used-say, acting randomly on some fraction of steps, where the fraction decreases over time-then we can dispense with the statistics. The Q-learning agent learns the optimal policy for the 4 x 3 world, but does so at a much slower rate than the ADP agent. This is because TD does not enforce consistency among values via the model. The comparison raises a general question: is it better to learn a model and a utility function or to learn an action-value function with no model? In other words, what is the best way to represent the agent function? This is an issue at the foundations of artificial intelligence. As we stated in Chapter 1, one of the key historical characteristics of much of A1 research is its (often unstated) adherence to the knowledge-based approach. This amounts to an assumption that the best way to represent the agent function is to build a representation of some aspects of the environment in which the agent is situated. Section 2 1.4. Generalization in Reinforcement Learning 777 Some researchers, both inside and outside AI, have claimed that the availability of model-free methods such as Q-learning means that the knowledge-based approach is unnec- essary. There is, however, little to go on but intuition. Our intuition, for what it's worth, is that as the environment becomes more complex, the advantages of a knowledge-based approach become more apparent. This is borne out even in games such as chess, checkers (draughts), and backgammon (see next section), where efforts to learn an evaluation function by means of a model have met with more success than Q-learning methods. So far, we have assumed that the utility functions and Q-functions learned by the agents are represented in tabular form with one output value for each input tuple. Such an ,approach works reasonably well for small state spaces, but the time to comvergence and (for ADP) the time per iteration increase rapidly as the space gets larger. With carefully controlled, approx- imate ADP methods, it might be possible to handle 10,000 states or more. This suffices for two-dimensional maze-like environments, but more realistic worlds are out of the question. Chess and backgammon are tiny subsets of the real world, yet their state spaces contain on the order of lo5' to states. It would be absurd to suppose that one must visit all these states in order to learn how to play the game FUNCTION One way to handle such problems is to use function approximation, which simply APPROXIMATION means using any sort of representation for the function other than a table. The representation is viewed as approximate because it might not be the case that the true utility function or -function can be represented in the chosen form. For example, in Chapter 6 we described Q an evaluation function for chess that is represented as a weighted linear function of a set of BASIS FUNCTIONS features (or basis functions) fi, . . . , f,: A reinforcement learning algorithm can learn values for the parameters 8 = 01, . . . ,On such that the evaluation function U approximates the true utility function. Instead of, say, values in a table, this function approximator is characterized by, say, n = 20 parameters- an enormous compression. Although no one knows the true utility function for chess, no one believes that it can be represented exactly in 20 numbers. If the approximation is good enough, however, the agent might still play excellent chess.4 Function approximation makes it practical to represent utility functions for very large state spaces, but that is not its principal benefit. The compression achieved by a function approximator allows the learning agent to generaliefrom states it has visited to states it has not visited. That is, the most important aspect of function approximation is not that it We do know that the exact utility function can be represented in a page or two of Lisp, Java, or C++. That is, it can be represented by a program that solves the game exactly eveiry time it is called. We are interested only in function approximators that use a reasonable amount of computation. It rnight in fact be better to learn a very simple function approximator and combine it with a certain amount of look-ahead search. The trade-offs involved are currently not well understood. 778 Chapter 2 1. Reinforcement Learning requires less space, but that it allows for inductive generalization over input states. To give you some idea of the power of this effect: by examining only one in every of the possible backgammon states, it is possible to learn a utility function that allows a program to play as well as any human (Tesauro, 1992). On the flip side, of course, there is the problem that there could fail to be any function in the chosen hypothesis space that approximates the true utility function sufficiently well. As in all inductive learning, there is a trade-off between the size of the hypothesis space and the time it takes to learn the function. A larger hypothesis space increases the likelihood that a good approximation can be found, but also means that convergence is likely to be delayed. Let us begin with the simplest case, which is direct utility estimation. (See Section 21.2.) With function approximation, this is an instance of supervised learning. For example, sup- pose we represent the utilities for the 4 x 3 world using a simple linear function. The features of the squares are just their x and y coordinates, so we have (21.9) UO(X, y) = do + BIZ + dzy . Thus, if (QO, el, d2) = (0.5,0.2,0. I), then u0 (1,l) = 0.8. Given a collection of trials, we ob- tain a set of sample values of & (rc, y), and we can find the best fit, in the sense of minimizing the squared error, using standard linear regression. (See Chapter 20.) For reinforcement learning, it makes more sense to use an online learning algorithm that updates the parameters after each trial. Suppose we run a trial and the total reward obtained starting at (1,l) is 0.4. This suggests that (1, l), currently 0.8, is too large and must be reduced. How should the parameters be adjusted to achieve this? As with neural network learning, we write an error function and compute its gradient with respect to the parameters. If uj(s) is the observed total reward from state s onward in the jth trial, then the error is defined as (half) the squared difference of the predicted total and the actual total: Ei (s) = (uQ (s) - uj ())/2. The rate of change of the error with respect to each parameter di is aEj/aQi, so to move the parameter in the direction of decreasing the error, we want WIDROW-HOFFRULE This is called the Widrow-Hoff rule, or the delta rule, for online least-squares. For the DELTA RULE linear function approximator u(s) in Equation (21.9), we get three simple update rules: 00 + Qo + a (ui (s) - UO (s)) , 1 + Ql t- ((3) - UQ(S))X, 82 + Q2+01(uj(s) -u(s)). We can apply these rules to the example where u0(l,1) is 0.8 and uj(l, 1) is 0.4. Oo, dl, and o2 are all decreased by 0.4a, which reduces the error for (1,l). Notice that changing the Bis also changes the values of U for every other state This is what we mean by saying that function approximation allows a reinforcement learner to generalize from its experiences. We expect that the agent will learn faster if it uses a function approximator, provided that the hypothesis space is not too large, but includes some functions that are a reasonably good fit to the true utility function. Exercise 21.7 asks you to evaluate the performance of direct utility estimation, both with and without function approximation. The improvement Section 2 1.4. Generalization in Reinforcement Learning 779 in the 4 x 3 world is noticeable but not dramatic, because this is a very small state space to begin with. The improvement is much greater in a 10 x 10 world with a +1 reward at (10,lO). This world is well suited for a linear utility function because the true utility function If we put the +1 reward at (5,5), the is smooth and nearly linear. (See Exercise 21.10.) true utility is more like a pyramid and the function aipproximator in Equation (21.9) will fail miserably. All is not lost, however Remember that what matters for linear function approximation is that the function be linear in the parameters-the features themselves can be arbitrary nonlinear functions of the state variables. Hence, we can include a term such as 83J(LLg)2 + (y - that measures the distance to the goal. We can apply these ideas equally well to temporal-difference learners. All we need do is adjust the parameters to try to reduce the temporal difference between successive states. The new versions of the TD and Q-learning equations (21.3 and 21.8) are for utilities and ago (a, S) Qi + Oi t a R(s) + 7 rnax (a', s') - (a, s)- a aOi for Q-values. These update rules can be shown to converge to the closest possible5 approxi- mation to the true function when the function approxinlator is linear in the parameters. Un- fortunately, all bets are off when nonlinear functionssuch as neural networks-are used. There are some very simple cases in which the parameters can go off to infinity even though there are good solutions in the hypothesis space. There are mare sophisticated algorithms that can avoid these problems, but at present reinforce:ment learning with general function approximators remains a delicate art. Function approximation can also be very helpful for learning a model of the environ- ment. Remember that learning a model for an observable environment is a supervised learn- ing problem, because the next percept gives the outcome state. Any of the supervised learning methods in Chapter 18 can be used, with suitable adjustments for the fact that we need to pre- dict a complete state description rather than just a Boolean classification or a single real value. For example, if the state is defined by n Boolean variables, we will need to learn n Boolean functions to predict all the variables. For a partially observ(2ble environment, the learning problem is much more difficult. If we know what the hidden variables are and how they are causally related to each other and to the observable variables, then we can fix the structure of a dynamic Bayesian network and use the EM algorithm to learn the parameters, as was described in Chapter 20. Inventing the hidden variables and lleaming the model structure are still open problems. We now turn to examples of large-scale applications of reinforcement learning. We will see that, in cases where a utility function (and hence a model) is used, the model is usually taken as given. For example, in learning an evaluation fnction for backgammon, it is normally assumed that the legal moves and their effects are known in advance. The defi nition of distance between utility functions is rather technical; see Tsitsiklis and Van Roy (1997). 780 Chapter 21. Reinforcement Learning Applications to game-playing The first significant application of reinforcement learning was also the first significant learn- ing program of any kind-the checker-playing program written by Arthur Samuel (1959, 1967). Samuel first used a weighted linear function for the evaluation of positions, using up to 16 terms at any one time. He applied a version of Equation (21.1 1) to update the weights. There were some significant differences, however, between his program and current methods. First, he updated the weights using the difference between the current state and the backed-up value generated by full look-ahead in the search tree. This works fine, because it amounts to viewing the state space at a different granularity. A second difference was that the program did not use any observed rewards That is, the values of terminal states were ignored. This means that it is quite possible for Samuel's program not to converge, or to converge on a strategy designed to lose rather than to win. He managed to avoid this fate by insisting that the weight for material advantage should always be positive. Remarkably, this was sufficient to direct the program into areas of weight space corresponding to good checker play. Gerry Tesauro's TD-Gammon system (1992) forcefully illustrates the potential of re- inforcement learning techniques. In earlier work (Tesauro and Sejnowski, 1989), Tesauro tried learning a neural network representation of Q(a, s) directly from examples of moves labeled with relative values by a human expert. This approach proved extremely tedious for the expert. It resulted in a program, called NEUROGAMMON, that was strong by computer standards, but not competitive with human experts. The TD-Gammon project was an attempt to learn from self-play alone. The only reward signal was given at the end of each game. The evaluation function was represented by a fully connected neural network with a single hidden layer containing 40 nodes. Simply by repeated application of Equation (21.1 I), TD-Gammon learned to play considerably better than Neurogammon, even though the input representation contained just the raw board position with no computed features. This took about 200,000 training games and two weeks of computer time. Although that may seem like a lot of games, it is only a vanishingly small fraction of the state space. When precomputed features were added to the input representation, a network with 80 hidden units was able, after 300,000 training games, to reach a standard of play comparable to that of the top three human players worldwide. Kit Woolsey, a top player and analyst, said that "There is no question in my mind that its positional judgment is far better than mine." Application to robot control CART-POLE The setup for the famous cart-pole balancing problem, also known as the inverted pendu- INVERTED lum, is shown in Figure 21.9. The problem is to control the position x of the cart so that PENDULUM the pole stays roughly upright (0 x 7r/2), while staying within the limits of the cart track as shown. Over two thousand papers in reinforcement learning and control theory have been published on this seemingly simple problem. The cart-pole problem differs from the prob- lems described earlier in that the state variables z, 0, 2, and 0 are continuous. The actions are BANG-BANG usually discrete: jerk left or jerk right, the so-called bang-bang control regime. CONTROL The earliest work on learning for this problem was carried out by Michie and Cham- bers (1968). Their BOXES algorithm was able to balance the pole for over an hour after only Section 21.5. Policy Search 78 1 Setup for the problem of balancing a long pole on top of a moving cart. The Figure 21.9 cart can be jerked left or right by a controller that observes z, 0, i, and 0. L about 30 trials. Moreover, unlike many subsequent systems, Box:s was implemented with a -dimensional state real cart and pole, not a simulation. The algorithm first discretized the four space into boxes-hence the name. It then ran trials until the pole fell over or the cart hit the end of the track. Negative reinforcement was associated with the final action in the final box and then propagated back through the sequence. It was found that the discretization caused some problems when the apparatus was initialized in a position different from those used in training, suggesting that generalization was not perfect. Imrroved generalization and faster learning can be obtained using an algorithm that adaptively partitions the state space accord- ing to the observed variation in the reward. Nowadays, balancing a triple inverted pendulum is a common exercise-a feat far beyond the capabilities of most humans. The final approach we will consider for reinforcement learning problems is called policy POLICYSEARCH search. In some ways, policy search is the simplest sf all the methods in this chapter: the idea is to keep twiddling the policy as long as its performance improves, then stop. Let us begin with the policies themselves. Remember that a policy T is a function that maps states to actions. We are interested primarily in parameferized representations of .ir that have far fewer parameters than there are states in the state space (just as in the preceding section). For example, we could represent .ir by a collection of parameterized Q-functions, one for each action, and take the action with the highest predicted value: ;. (s) = rnax QQ (a, s) . (21.13) Each Q-function could be a linear function of the parameterls 6, as in Equation (211.9), or it could be a nonlinear function such as a neural network. Policy search will then adjust the parameters 6 to improve the policy. Notice that if the policy is represented by Q-functions, 782 Chapter 2 1. Reinforcement Learning then policy search results in a process that learns Q-functions. This process is not the same as Q-learning In Q-learning with function approximation, the algorithm finds a value of Q such that is "close" to Q, the optimal Q-function. Policy search, on the other hand, finds a value of Q that results in good performance; the values found may differ very sbstantiall. Another clear example of the difference is the case where n(s) is calculated using, say, depth- 10 look-ahead search with an approximate utility function u. The value of 0 that gives good play may be a long way from making & resemble the true utility function. One problem with policy representations of the kind given in Equation (21.13) is that the policy is a discontinuous function of the parameters when the actions are dicrete. That is, there will be values of I9 such that an infinitesimal change in 19 causes the policy to switch from one action to another. This means that the value of the policy may also change dis- continuously, which makes gradient-based search difficult. For this reason, policy search STOCHASTIC POLICY methods often use a stochastic policy representation T(s, a), which specifies the probability SOFTMAXFUNCTION of selecting action a in state s. One popular representation is the softmax function: Softmax becomes nearly deterministic if one action is much better than the others, but it always gives a differentiable function of 8; hence, the value of the policy (which depends in a continuous fashion on the action selection probabilities) is a differentiable function of 8. Now let us look at methods for improving the policy. We start with the simplest case: a deterministic policy and a deterministic environment. In this case, evaluating the policy is trivial: we simply execute it and observe the accumulated reward; this gives us the policy POLICYVALUE value p(Q). Improving the policy is just a standard optimization problem, as described in POLICY GRADIENT Chapter 4. We can follow the policy gradient vector Vep(19) provided p(Q) is differentiable. Alternatively, we can follow the empirical gradient by hillclimbing-i.e., evaluating the change in policy for small increments in each parameter value. With the usual caveats, this process will converge to a local optimum in policy space. When the environment (or the policy) is stochastic, things get more difficult. Suppose we are trying to do hillclimbing, which requires comparing p(8) and p(Q + AQ) for some small AQ. The problem is that the total reward on each trial may vary widely, so estimates of the policy value from a small number of trials will be quite unreliable; trying to compare two such estimates will be even more unreliable. One solution is simply to run lots of trials, measuring the sample variance and using it to determine that enough trials have been run to get a reliable indication of the direction of improvement for p(8). Unfortunately, this is impractical for many real problems where each trial may be expensive, time-consuming, and perhaps even dangerous. For the case of a stochastic policy .n;(s, a), it is possible to obtain an unbiased estimate of the gradient at 8, Vep(Q), directly from the results of trials executed at 8. For simplicity, we will derive this estimate for the simple case of a nonsequential environment in which the Trivially, the approximate Q-function defined by djs (a, s) = Q (a, s)/10 gives optimal performance, even though it is not at all close to Q. For a continuous action space, the policy can be a smooth function of the parameters.

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.