Reinforcement Learning

      1. The General Philosophy

This section briefly introduces the basic ideas of a framework that has been highly popular in the artificial intelligence community in recent years. It was developed

and used primarily by machine learning researchers [19, 930], and therefore this section is called reinforcement learning. The problem generally involves comput- ing optimal plans for probabilistic infinite-horizon problems. The basic idea is to combine the problems of learning the probability distribution, P(θ x, u), and computing the optimal plan into the same algorithm.

|

Terminology Before detailing the method further, some explanation of existing names seems required. Consider the term reinforcement learning. In machine learning, most decision-theoretic models are expressed in terms of reward instead of cost. Thus, the task is to make decisions or find plans that maximize a reward functional. Choosing good actions under this model appears to provide positive reinforcement in the form of a reward. Therefore, the term reinforcement is used. Using cost and minimization instead, some alternative names may be decision- theoretic learning or cost-based learning.

The term learning is associated with the problem because estimating the prob- ability distribution P(θ x, u) or P(x′ x, u) is clearly a learning problem. However, it is important to remember that there is also the planning problem of computing cost-to-go functions (or reward-to-go functions) and determining a plan that opti- mizes the costs (or rewards). Therefore, the term reinforcement planning may be just as reasonable.

| |

The general framework is referred to as neuro-dynamic programming in [97] because the formulation and resulting algorithms are based on dynamic program- ming. Most often, a variant of value iteration is obtained. The neuro part refers to a family of functions that can be used to approximate plans and cost-to-go values. This term is fairly specific, however, because other function families may be used. Furthermore, for some problems (e.g., over small, finite state spaces), the cost values and plans are represented without approximation.

The name simulation-based methods is used in [95], which is perhaps one of the most accurate names (when used in the context of dynamic programming). Thus, simulation-based dynamic programming or simulation-based planning nicely reflects the framework explained here. The term simulation comes from the fact that a Monte Carlo simulator is used to generate samples for which the required distributions are learned during planning. You are, of course, welcome to use your favorite name, but keep in mind that under all of the names, the idea remains the same. This will be helpful to remember if you intend to study related literature.

The general framework The framework is usually applied to infinite-horizon problems under probabilistic uncertainty. The discounted-cost model is most pop- ular; however, we will mostly work with Formulation 10.1 because it is closer to the main theme of this book. It has been assumed so far that when planning un- der Formulation 10.1, all model components are known, including P(xk+1 xk, uk). This can be considered as a traditional framework, in which there are three general phases:

|

Learning/Planning/Execution Algorithm Apply action uk from xk Monte Carlo Simulator
Next state, xk+1
Cost value, l(xk, uk)

Figure 10.11: The general framework for reinforcement learning (or simulation- based dynamic programming).

Learning phase: The transition probabilities are estimated by visiting states in X, trying actions, and gathering statistics. When this phase con- cludes, the model of the environment is completely known.

Planning phase: An algorithm computes a feedback plan using a method such as value iteration or policy iteration.

Execution phase: The plan is executed on a machine that is connected to the same environment on which the learning phase was applied.

The simulation-based framework combines all three of these phases into one. Learning, planning, and execution are all conducted by a machine that initially knows nothing about the state transitions or even the cost terms. Ideally, the ma- chine should be connected to a physical environment for which the Markov model holds. However, in nearly all implementations, the machine is instead connected to a Monte Carlo simulator as shown in Figure 10.11. Based on the current state, the algorithm sends an action, uk, to the simulator, and the simulator computes its effect by sampling according to its internal probability distributions. Obviously, the designer of the simulator knows the transition probabilities, but these are not given directly to the planning algorithm. The simulator then sends the next state, xk+1, and cost, l(xk, uk), back to the algorithm.

For simplicity, l(xk, uk) is used instead of allowing the cost to depend on the particular nature action, which would yield l(xk, uk, θk). The explicit charac- terization of nature is usually not needed in this framework. The probabilities

P(xk+1|xk, uk) are directly learned without specifying nature actions. It is com-

mon to generalize the cost term from l(xk, uk) to l(xk, uk, xk+1), but this is avoided here for notational convenience. The basic ideas remain the same, and only slight variations of the coming equations are needed to handle this generalization.

The simulator is intended to simulate “reality,” in which the machine interacts with the physical world. It replaces the environment in Figure 1.16b from Section

1.4. Using the interpretation of that section, the algorithms presented in this context can be considered as a plan as shown in Figure 1.18b. If the learning component is terminated, then the resulting feedback plan can be programmed into

another machine, as shown in Figure 1.18a. This step is usually not performed, however, because often it is assumed that the machine continues to learn over its lifetime.

One of the main issues is exploration vs. exploitation [930]. Some repetitive exploration of the state space is needed to gather enough data that reliably esti- mate the model. For true theoretical convergence, each state-action pair must be tried infinitely often. On the other hand, information regarding the model should be exploited to efficiently accomplish tasks. These two goals are often in conflict. Focusing too much on exploration will not optimize costs. Focusing too much on exploitation may prevent useful solutions from being developed because better alternatives have not yet been discovered.

      1. Evaluating a Plan via Simulation

The simulation method is based on averaging the information gained incrementally from samples. Suppose that you receive a sequence of costs, c1, c2, . . ., and would like to incrementally compute their average. You are not told the total number of samples in advance, and at any point you are required to report the current average. Let mi denote the average of the first i samples,

1 i

-

mi =

i cj . (10.81)

j=1

To efficiently compute mi from mi−1, multiply mi−1 by i − 1 to recover the total,

add ci, and then divide by i:

mi

This can be manipulated into

(i 1)m + c

= . (10.82)

i−1 i

i

1

mi = mi−1 + i (ci − mi−1). (10.83)

Now consider the problem of estimating the expected cost-to-go, Gπ (x), at every x X for some fixed plan, π. If P(x′ x, u) and the costs l(x, u) were known, then it could be computed by solving

∈ |

Gπ (x) = l(x, u) + P(x′ x, u)Gπ (x′). (10.84)

  • |

x

However, without this information, we must rely on the simulator.

From each x X, suppose that 1000 trials are conducted, and the resulting costs to get to the goal are recorded and averaged. Each trial is an iterative process in which π selects the action, and the simulator indicates the next state and its incremental cost. Once the goal state is reached, the costs are totaled to yield the

measured cost-to-go for that trial (this assumes that π(x) = uT for all x ∈ XG).

If ci denotes this total cost at trial i, then the average, mi, over i trials provides an estimate of Gπ (x). As i tends to infinity, we expect mi to converge to Gπ (x). The update formula (10.83) can be conveniently used to maintain the improving sequence of cost-to-go estimates. Let Gˆπ (x) denote the current estimate of Gπ (x). The update formula based on (10.83) can be expressed as

Gˆ (x) := Gˆ

π

1

(x) + (l(x , u ) + l(x , u ) + · · · + l(x , u

π

i

1

1

2

2

K

K

) − Gˆ

(x)), (10.85)

π

in which := means assignment, in the sense used in some programming languages. It turns out that a single trial can actually yield update values for multiple states [930, 96]. Suppose that a trial is performed from x that results in the sequence x1 = x, x2, . . ., xk, . . ., xK , xF of visited states. For every state, xk, in the sequence, a cost-to-go value can be measured by recording the cost that was

accumulated from xk to xK :

K

-

ck(xk) = l(xj , uj ). (10.86)

j=k

It is much more efficient to make use of (10.85) on every state that is visited along the path.

Temporal differences Rather than waiting until the end of each trial to com- pute ci(xi), it is possible to update each state, xi, immediately after it is visited and l(xi, ui) is received from the simulator. This leads to a well-known method of estimating the cost-to-go called temporal differences [929]. It is very similar to the method already given but somewhat more complicated. It will be introduced here because the method frequently appears in reinforcement learning literature, and an extension of it leads to a nice simulation-based method for updating the estimated cost-to-go.

Once again, consider the sequence x1, . . ., xK , xF generated by a trial. Let dk

denote a temporal difference, which is defined as

dk = l(xk, uk) + Gˆπ (xk+1) − Gˆπ (xk). (10.87)

Note that both l(xk, uk) + Gˆπ (xk+1) and Gˆπ (xk) could each serve as an estimate of Gπ (xk). The difference is that the right part of (10.87) utilizes the latest cost obtained from the simulator for the first step and then uses Gˆπ (xk+1) for an esti- mate of the remaining cost. In this and subsequent expressions, every action, uk, is chosen using the plan: uk = π(xk).

Let vk denote the number of times that xk has been visited so far, for each 1 ≤ k ≤ K, including previous trials and the current visit. The following update

algorithm can be used during the trial. When x2 is reached, the value at x1 is updated as

Gˆ (x ) := Gˆ

1

(x ) + d . (10.88)

v

π 1 π 1 1

1

When x3 is reached, the values at x1 and x2 are updated as

Gˆ (x ) := Gˆ

1

(x ) + d ,

v

π 1 π 1 2

1

(10.89)

Gˆ (x ) := Gˆ

1

(x ) + d .

v

π 2 π 2 2

2

Now consider what has been done so far at x1. The temporal differences partly collapse:

Gˆ (x ) :=Gˆ

1 1

(x ) + d + d

v

v

π 1 π 1 1 2

1 1

1

=Gˆ

(x ) + (l(x , u ) + Gˆ

v

(x ) − Gˆ

(x ) + l(x , u ) + Gˆ

(x ) − Gˆ

(x ))

π 1 1 1 π 2 π 1

1

1

2 2 π 3 π 2

=Gˆ

(x ) + (l(x , u ) + l(x , u ) − Gˆ

v

(x ) + Gˆ

(x )).

π 1 1 1

1

2 2 π 1 π 3

(10.90)

When x4 is reached, similar updates are performed. At xk, the updates are

Gˆ (x ) :=Gˆ

1

(x ) + d ,

v

π 1 π 1 k

1

Gˆ (x ) :=Gˆ

1

(x ) + d ,

v

π 2 π 2 k

2

...

(10.91)

π

(xk

) :=Gˆπ

(xk

1

) + dk. vk

The updates are performed in this way until xF ∈ XG is reached. Now consider

what was actually computed for each xk. The temporal differences form a tele- scoping sum that collapses, as shown in (10.90) after two iterations. After all iterations have been completed, the value at xk has been updated as

π

(xk

) :=Gˆπ

(xk

1

) + dk vk

1

  • d

vk+1

k+1

1

  • · · · + v dK

K

1

  • dF

v

F

=Gˆ

1

(x ) + (l(x , u ) + l(x , u ) + · · · + l(x

v

, u ) − Gˆ (x ) + Gˆ (x ))

π k 1 1 2 2

k

K K π k π F

=Gˆ

1

(x ) + (l(x , u ) + l(x , u ) + · · · + l(x

v

, u ) − Gˆ (x )).

π k 1 1 2 2

k

K K π k

(10.92)

The final Gˆπ (xF ) was deleted because its value is zero, assuming that the termina- tion action is applied by π. The resulting final expression is equivalent to (10.85) if each visited state in the sequence was distinct. This is often not true, which makes the method discussed above differ slightly from the method of (10.85) because the

count, vk, may change during the trial in the temporal difference scheme. This difference, however, is negligible, and the temporal difference method computes estimates that converge to Gˆπ [96, 97].

The temporal difference method presented so far can be generalized in a way that often leads to faster convergence in practice. Let λ [0, 1] be a specified pa- rameter. The T D(λ) temporal difference method replaces the equations in (10.91) with

Gˆ (x ) :=Gˆ

π

1

(x ) + λk−1 ( 1 d \ ,

π

1

v

k

1

Gˆ (x ) :=Gˆ (x ) + λk−2 ( 1 d \ ,

v

π 2 π 2 k

2

...

(10.93)

π

(xk−1

) :=Gˆπ

(xk−1

) + λ 1 d ,

vk−1

k

π

(xk

) :=Gˆπ

(xk

1

) + dk. vk

This has the effect of discounting costs that are received far away from xk. The method in (10.91) was the special case of λ = 1, yielding T D(1).

Another interesting special case is T D(0), which becomes

π

(xk

) = Gˆπ

(xk

1

) + l(xk

(v

k

, uk

) + Gˆπ

(xk+1

) − Gˆπ

(xk

)\ . (10.94)

This form appears most often in reinforcement learning literature (although it is applied to the discounted-cost model instead). Experimental evidence indicates that lower values of λ help to improve the convergence rate. Convergence for all values of λ is proved in [97].

One source of intuition about why (10.94) works is that it is a special case of a stochastic iterative algorithm or the Robbins-Monro algorithm [88, 97, 566]. This is a general statistical estimation technique that is used for solving systems of the form h(y) = y by using a sequence of samples. Each sample represents a measurement of h(y) using Monte Carlo simulation. The general form of this iterative approach is to update y as

y := (1 − ρ)y + ρh(y), (10.95)

in which ρ [0, 1] is a parameter whose choice affects the convergence rate. In- tuitively, (10.95) updates y by interpolating between its original value and the most recent sample of h(y). Convergence proofs for this algorithm are not given here; see [97] for details. The typical behavior is that a smaller value of ρ leads to more reliable estimates when there is substantial noise in the simulation process, but this comes at the cost of slowing the convergence rate. The convergence is asymptotic, which requires that all edges (that have nonzero probability) in the plan-based state transition graph should be visited infinitely often.

A general approach to obtaining Gˆπ can be derived within the stochastic iter- ative framework by generalizing T D(0):

π (x) := (1 − ρ)Gˆπ (x) + ρ (l(x, u) + Gˆπ (x′)\ . (10.96)

The formulation of T D(0) in (10.94) essentially selects the ρ parameter by the way it was derived, but in (10.96) any ρ (0, 1) may be used.

It may appear incorrect that the update equation does not take into account the transition probabilities. It turns out that they are taken into account in the simulation process because transitions that are more likely to occur have a stronger effect on (10.96). The same thing occurs when the mean of a nonuniform probability density function is estimated by using samples from the distribution. The values that occur with higher frequency make stronger contributions to the average, which automatically gives them the appropriate weight.

      1. Q-Learning: Computing an Optimal Plan

This section moves from evaluating a plan to computing an optimal plan in the simulation-based framework. The most important idea is the computation of Q- factors, Q∗(x, u). This is an extension of the optimal cost-to-go, G∗, that records optimal costs for each possible combination of a state, x X, and action u U(x). The interpretation of Q∗(x, u) is the expected cost received by starting from state x, applying u, and then following the optimal plan from the resulting next state, x′ = f(x, u, θ). If u happens to be the same action as would be selected by the optimal plan, π∗(x), then Q∗(x, u) = G∗(x). Thus, the Q-value can be thought of as the cost of making an arbitrary choice in the first stage and then exhibiting optimal decision making afterward.

∈ ∈

Value iteration A simulation-based version of value iteration can be constructed from Q-factors. The reason for their use instead of G∗ is that a minimization over U(x) will be avoided in the dynamic programming. Avoiding this minimization enables a sample-by-sample approach to estimating the optimal values and ulti- mately obtaining the optimal plan. The optimal cost-to-go can be obtained from the Q-factors as

J \

G∗(x) = min Q∗(x, u) . (10.97)

uU (x)

This enables the dynamic programming recurrence in (10.46) to be expressed as

Q∗(x, u) = l(x, u) + - P(x′|x, u) u′min′) JQ∗(x′, u′)\. (10.98)

x′∈X

U (x

By applying (10.97) to the right side of (10.98), it can also be expressed using G∗

as

-

Q∗(x, u) = l(x, u) + P(x′|x, u)G∗(x′). (10.99)

x′∈X

If P(x′ x, u) and l(x, u) were known, then (10.98) would lead to an alternative, storage-intensive way to perform value iteration. After convergence occurs, (10.97) can be used to obtain the G∗ values. The optimal plan is constructed as

|

π∗(x) = argmin Q∗(x, u) . (10.100)

J \

uU (x)

Since the costs and transition probabilities are unknown, a simulation-based approach is needed. The stochastic iterative algorithm idea can be applied once again. Recall that (10.96) estimated the cost of a plan by using individual sam- ples and required a convergence-rate parameter, ρ. Using the same idea here, a simulation-based version of value iteration can be derived as

Qˆ∗(x, u) := (1 − ρ)Qˆ∗(x, u) + ρ (l(x, u) +

min

u′∈U (x′)

JQˆ∗(x′, u′)\\ , (10.101)

in which x′ is the next state and l(x, u) is the cost obtained from the simulator when u is applied at x. Initially, all Q-factors are set to zero. Sample trajectories that arrive at the goal can be generated using simulation, and (10.101) is applied to the resulting states and costs in each stage. Once again, the update equation may appear to be incorrect because the transition probabilities are not explicitly mentioned, but this is taken into account automatically through the simulation.

In most literature, Q-learning is applied to the discounted cost model. This yields a minor variant of (10.101):

Qˆ∗(x, u) := (1 − ρ)Qˆ∗(x, u) + ρ (l(x, u) + α

min

u′∈U (x′)

JQˆ∗(x′, u′)\\ , (10.102)

in which the discount factor α appears because the update equation is derived from (10.76).

Policy iteration A simulation-based policy iteration algorithm can be derived using Q-factors. Recall from Section 10.2.2 that methods are needed to: 1) evaluate a given plan, π, and 2) improve the plan by selecting better actions. The plan evaluation previously involved linear equation solving. Now any plan, π, can be

evaluated without even knowing P(x′|x, u) by using the methods of Section 10.4.2.

Once Gˆπ is computed reliably from every x ∈ X, further simulation can be used

to compute Qπ (x, u) for each x X and u U. This can be achieved by defining a version of (10.99) that is constrained to π:

∈ ∈

Qπ (x, u) = l(x, u) + P(x′|x, u)Gπ (x′). (10.103)

-

x′∈X

The transition probabilities do not need to be known. The Q-factors are computed by simulation and averaging. The plan can be improved by setting

π′(x) = argmin Q∗(x, u) , (10.104)

J \

uU (x)

which is based on (10.97).

P2 act

Cost 3 3

5 1 −1 7

0 −2 4

Figure 10.12: A 3 × 3 matrix game expressed using a game tree.

results matching ""

    No results matching ""