Introducing Sequential Games Against Na- ture

This section extends many ideas from Chapter 2 to the case in which nature in- terferes with the outcome of actions. Section 10.1.1 defines the planning problem in this context, which is a direct extension of Section 2.1. Due to unpredictabil- ity, forward projections and backprojections are introduced in Section 10.1.2 to characterize possible future and past states, respectively. Forward projections characterize the future states that will be obtained under the application of a plan or a sequence of actions. In Chapter 2 this concept was not needed because the sequence of future states could always be derived from a plan and initial state. Section 10.1.3 defines the notion of a plan and uses forward projections to indicate how its execution may differ every time the plan is applied.

      1. Model Definition

The formulation presented in this section is an extension of Formulation 2.3 that incorporates the effects of nature at every stage. Let X denote a discrete state

space, and let U(x) denote the set of actions available to the decision maker (or robot) from state x ∈ X. At each stage k it is assumed that a nature action θk is

chosen from a set Θ(xk, uk). This can be considered as a multi-stage generalization of Formulation 9.4, which introduced Θ(u). Now Θ may depend on the state in addition to the action because both xk and uk are available in the current setting. This implies that nature acts with the knowledge of the action selected by the decision maker. It is always assumed that during stage k, the decision maker does

not know the particular nature action that will be chosen. It does, however, know the set Θ(xk, uk) for all xk ∈ X and uk ∈ U(xk).

As in Section 9.2, there are two alternative nature models: nondeterministic or probabilistic. If the nondeterministic model is used, then it is only known that nature will make a choice from Θ(xk, uk). In this case, making decisions using worst-case analysis is appropriate.

If the probabilistic model is used, then a probability distribution over Θ(xk, uk) is specified as part of the model. The most important assumption to keep in mind for this case is that nature is Markovian. In general, this means that the probability depends only on local information. In most applications, this locality is with respect to time. In our formulation, it means that the distribution over Θ(xk, uk) depends only on information obtained at the current stage. In other settings, Markovian could mean a dependency on a small number of stages, or

even a local dependency in terms of spatial relationships, as in a Markov random field [231, 377].

To make the Markov assumption more precise, the state and action histories as defined in Section 8.2.1 will be used again here. Let

k = (x1, x2, . . . , xk) (10.1)

and

k = (u1, u2, . . . , uk). (10.2)

These represent all information that is available up to stage k. Without the Markov assumption, it could be possible that the probability distribution for nature is

conditioned on all of x˜k and u˜k, to obtain P(θk|x˜k, u˜k). The Markov assumption declares that for all θk ∈ Θ(xk, uk),

P(θk|x˜k, u˜k) = P(θk|xk, uk), (10.3)

which drops all history except the current state and action. Once these two are known, there is no extra information regarding the nature action that could be gained from any portion of the histories.

The effect of nature is defined in the state transition equation, which produces a new state, xk+1, once xk, uk, and θk are given:

xk+1 = f(xk, uk, θk). (10.4)

From the perspective of the decision maker, θk is not given. Therefore, it can only infer that a particular set of states will result from applying uk and xk:

Xk+1(xk, uk) = xk+1 X θk Θ(xk, uk) such that xk+1 = f(xk, uk, θk) .

{ ∈ | ∃ ∈ }

(10.5)

In (10.5), the notation Xk+1(xk, uk) indicates a set of possible values for xk+1, given

xk and uk. The notation Xk(·) will generally be used to indicate the possible values

for xk that can be derived using the information that appears in the argument.

In the probabilistic case, a probability distribution over X can be derived for stage k + 1, under the application of uk from xk. As part of the problem,

P(θk|xk, uk) is given. Using the state transition equation, xk+1 = f(xk, uk, θk),

P(xk+1|xk, uk) =

-

θk ∈Θ

P(θk|xk, uk) (10.6)

can be derived, in which

Θ′ = {θk ∈ Θ(xk, uk) | xk+1 = f(xk, uk, θk)}. (10.7)

The calculation of P(xk+1|xk, uk) simply involves accumulating all of the proba- bility mass that could lead to xk+1 from the application of various nature actions.

Putting these parts of the model together and adding some of the components from Formulation 2.3, leads to the following formulation:

Formulation 10.1 (Discrete Planning with Nature)

        1. A nonempty state space X which is a finite or countably infinite set of states.
        2. For each state, x X, a finite, nonempty action space U(x). It is assumed that U contains a special termination action, which has the same effect as the one defined in Formulation 2.3.

        1. A finite, nonempty nature action space Θ(x, u) for each x ∈ X and u ∈ U(x).
        2. A state transition function f that produces a state, f(x, u, θ), for every

x ∈ X, u ∈ U, and θ ∈ Θ(x, u).

        1. A set of stages, each denoted by k, that begins at k = 1 and continues indefinitely. Alternatively, there may be a fixed, maximum stage k = K+1 = F.
        2. An initial state xI X. For some problems, this may not be specified, in which case a solution plan must be found from all initial states.

        1. A goal set XG ⊂ X.
        2. A stage-additive cost functional L. Let θ˜K denote the history of nature ac- tions up to stage K. The cost functional may be applied to any combination of state, action, and nature histories to yield

K

-

L(x˜F , u˜K , θ˜K ) = l(xk, uk, θk) + lF (xF ), (10.8)

k=1

in which F = K + 1. If the termination action uT is applied at some stage

k, then for all i ≥ k, ui = uT , xi = xk, and l(xi, uT , θi) = 0.

Using Formulation 10.1, either a feasible or optimal planning problem can be defined. To obtain a feasible planning problem, let l(xk, uk, θk) = 0 for all xk ∈ X, uk ∈ U, and θk ∈ Θk(uk). Furthermore, let

lF (xF

) = 0 if xF ∈ XG

∞ otherwise.

(10.9)

To obtain an optimal planning problem, in general l(xk, uk, θk) may assume any nonnegative, finite value if xk /∈ XG. For problems that involve probabilistic

uncertainty, it is sometimes appropriate to assign a high, finite value for lF (xF ) if

xF XG, as opposed to assigning an infinite cost for failing to achieve the goal.

/∈

Note that in each stage, the cost term is generally allowed to depend on the nature action θk. If probabilistic uncertainty is used, then Formulation 10.1 is often referred to as a controlled Markov process or Markov decision process (MDP). If the actions are removed from the formulation, then it is simply referred to as a Markov process. In most statistical literature, the name Markov chain is used instead of

Markov process when there are discrete stages (as opposed to continuous-time Markov processes). Thus, the terms controlled Markov chain and Markov decision chain may be preferable.

In some applications, it may be convenient to avoid the explicit characterization of nature. Suppose that l(xk, uk, θk) = l(xk, uk). If nondeterministic uncertainty is used, then Xk+1(xk, uk) can be specified for all xk X and uk U(xk) as a substitute for the state transition equation; this avoids having to refer to nature. The application of an action uk from a state xk directly leads to a specified subset of X. If probabilistic uncertainty is used, then P(xk+1 xk, uk) can be directly defined as the alternative to the state transition equation. This yields a probability distribution over X, if uk is applied from some xk, once again avoiding explicit reference to nature. Most of the time we will use a state transition equation that refers to nature; however, it is important to keep these alternatives in mind. They arise in many related books and research articles.

|

∈ ∈

As used throughout Chapter 2, a directed state transition graph is sometimes convenient for expressing the planning problem. The same idea can be applied in the current setting. As in Section 2.1, X is the vertex set; however, the edge definition must change to reflect nature. A directed edge exists from state x to x′

if there exists some u ∈ U(x) and θ ∈ Θ(x, u) such that x′ = f(x, u, θ). A weighted

graph can be made by associating the cost term l(xk, uk, θk) with each edge. In the case of a probabilistic model, the probability of the transition occurring may also be associated with each edge.

Note that both the decision maker and nature are needed to determine which vertex will be reached. As the decision maker contemplates applying an action u from the state x, it sees that there may be several outgoing edges due to nature. If a different action is contemplated, then this set of possible outgoing edges changes. Once nature applies its action, then the particular edge is traversed to arrive at the new state; however, this is not completely controlled by the decision maker.

Example 10.1 (Traversing the Number Line) Let X = Z, U = {−2, 2, uT }, and Θ = {−1, 0, 1}. The action sets of the decision maker and nature are the same

for all states. For the state transition equation, xk+1 = f(xk, uk, θk) = xk +ukk. For each stage, unit cost is received. Hence l(x, u, θ) = 1 for all x, θ, and u /= uT .

The initial state is xI = 100, and the goal set is XG = 1, 0, 1 .

{− }

Consider executing a sequence of actions, ( 2, 2, . . . , 2), under the non- deterministic uncertainty model. This means that we attempt to move left two

− − −

units in each stage. After the first −2 is applied, the set of possible next states is

{97, 98, 99}. Nature may slow down the progress to be only one unit per stage, or

it may speed up the progress so that XG is three units closer per stage. Note that after 100 stages, the goal is guaranteed to be achieved, in spite of any possible actions of nature. Once XG is reached, uT should be applied. If the problem is changed so that XG = 0 , it becomes impossible to guarantee that the goal will be reached because nature may cause the goal to be overshot.

{ }

Now let U = 1, 1, uT and Θ = 2, 1, 0, 1, 2 . Under nondeterministic uncertainty, the problem can no longer be solved because nature is now powerful

{− } {− − }

XG

Figure 10.1: A grid-based shortest path problem with interference from nature.

enough to move the state completely in the wrong direction in the worst case. A reasonable probabilistic version of the problem can, however, be defined and solved.

Suppose that P(θ) = 1/5 for each θ ∈ Θ. The transition probabilities can be de- fined from P(θ). For example, if xk = 100 and uk = −1, then P(xk+1|xk, uk) = 1/5 if 97 ≤ xk ≤ 101, and P(xk+1|xk, uk) = 0 otherwise. With the probabilistic for-

{− }

mulation, there is a nonzero probability that the goal, XG = 1, 0, 1 , will be reached, even though in the worst-case reaching the goal is not guaranteed. .

Example 10.2 (Moving on a Grid) A grid-based robot planning model can be made. A simple example is shown in Figure 10.1. The state space is a subset of a 15 15 integer grid in the plane. A state is represented as (i, j), in which

×

1 i, j 15; however, the points in the center region (shown in Figure 10.1) are not included in X.

≤ ≤

Let A = {0, 1, 2, 3, 4} be a set of actions, which denote “stay,” “right,” “up,” “left,” and “down,” respectively. Let U = A ∪ uT . For each x ∈ X, let U(x)

contain uT and whichever actions are applicable from x (some are not applicable along the boundaries).

Let Θ(x, u) represent the set of all actions in A that are applicable after per- forming the move implied by u. For example, if x = (2, 2) and u = 3, then the robot is attempting to move to (1, 2). From this state, there are three neighboring states, each of which corresponds to an action of nature. Thus, Θ(x, u) in this case is 0, 1, 2, 4 . The action θ = 3 does not appear because there is no state to the left of (1, 2). Suppose that the probabilistic model is used, and that every nature action is equally likely.

{ }

The state transition function f is formed by adding the effect of both uk and θk. For example, if xk = (i, j), uk = 1, and θk = 2, then xk+1 = (i + 1, j + 1). If θk had been 3, then the two actions would cancel and xk+1 = (i, j). Without nature, it would have been assumed that θk = 0. As always, the state never changes once uT is applied, regardless of nature’s actions.

For the cost functional, let l(xk, uk) = 1 (unless uk = uT ; in this case, l(xk, uT ) = 0). For the final stage, let lF (xF ) = 0 if xF ∈ XG; otherwise, let lF (xF ) = ∞. A

reasonable task is to get the robot to terminate in XG in the minimum expected number of stages. A feedback plan is needed, which will be introduced in Section 10.1.3, and the optimal plan for this problem can be efficiently computed using the methods of Section 10.2.1.

This example can be easily generalized to moving through a complicated labyrinth in two or more dimensions. If the grid resolution is high, then an approximation to motion planning is obtained. Rather than forcing motions in only four directions, it may be preferable to allow any direction. This case is covered in Section 10.6,

which addresses planning in continuous state spaces. .

      1. Forward Projections and Backprojections

A forward projection is a useful concept for characterizing the behavior of plans during execution. Before uncertainties were considered, a plan was executed ex- actly as expected. When a sequence of actions was applied to an initial state, the resulting sequence of states could be computed using the state transition equation. Now that the state transitions are unpredictable, we would like to imagine what states are possible several stages into the future. In the case of nondeterministic uncertainty, this involves computing a set of possible future states, given a current state and plan. In the probabilistic case, a probability distribution over states is computed instead.

Nondeterministic forward projections To facilitate the notation, suppose in this section that U(x) = U for all x ∈ X. In Section 10.1.3 this will be lifted.

Suppose that the initial state, x1 = xI , is known. If the action u1 U is applied, then the set of possible next states is

X2(x1, u1) = {x2 ∈ X | ∃θ1 ∈ Θ(x1, u1) such that x2 = f(x1, u1, θ1)}, (10.10)

which is just a special version of (10.5). Now suppose that an action u2 U will be applied. The forward projection must determine which states could be reached from x1 by applying u1 followed by u2. This can be expressed as

X3(x1, u1, u2) = {x3 ∈ X | ∃θ1 ∈ Θ(x1, u1) and ∃θ2 ∈ Θ(x2, u2)

such that x2 = f(x1, u1, θ1) and x3 = f(x2, u2, θ2)}.

(10.11)

This idea can be repeated for any number of iterations but becomes quite cum- bersome in the current notation. It is helpful to formulate the forward projection recursively. Suppose that an action history u˜k is fixed. Let Xk+1(Xk, uk) denote the forward projection at stage k + 1, given that Xk is the forward projection at

stage k. This can be computed as

Xk+1(Xk, uk) = {xk+1 ∈ X | ∃xk ∈ Xk and ∃θk ∈ Θ(xk, uk)

such that xk+1 = f(xk, uk, θk)}.

(10.12)

This may be applied any number of times to compute Xk+1 from an initial condi- tion X1 = {x1}.

Example 10.3 (Nondeterministic Forward Projections) Recall the first model given in Example 10.1, in which U = {−2, 2, uT } and Θ = {−1, 0, 1}. Sup-

pose that x1 = 0, and u = 2 is applied. The one-stage forward projection is

X2(0, 2) = {1, 2, 3}. If u = 2 is applied again, the two-stage forward projection is X3(0, 2, 2) = {2, 3, 4, 5, 6}. Repeating this process, the k-stage forward projection is {k, . . . , 3k}. .

Probabilistic forward projections The probabilistic forward projection can be considered as a Markov process because the “decision” part is removed once the actions are given. Suppose that xk is given and uk is applied. What is the probability distribution over xk+1? This was already specified in (10.6) and is the

one-stage forward projection. Now consider the two-stage probabilistic forward projection, P(xk+2|xk, uk, uk+1). This can be computed by marginalization as

P(xk+2|xk, uk, uk+1) =

-

xk+1∈X

P(xk+2|xk+1, uk+1)P(xk+1|xk, uk). (10.13)

Computing further forward projections requires nested summations, which marginal- ize all of the intermediate states. For example, the three-stage forward projection is

P(xk+3|xk, uk,uk+1, uk+2) =

    • P(xk+3|xk+2, uk+2)P(xk+2|xk+1, uk+1)P(xk+1|xk, uk).

xk+1∈X xk+2∈X

(10.14)

A convenient expression of the probabilistic forward projections can be obtained by borrowing nice algebraic properties from linear algebra. For each action u ∈ U,

let its state transition matrix Mu be an n n matrix, for n = X , of probabilities. The matrix is defined as

× | |

m1,_1 m1,2 · · · m1,n_

 

m2,_1 m2,2 · · · m2,n_

Mu =  ... ... ...  , (10.15)

mn,_1 m_n,_2 · · · m_n,n

in which

mi,j = P(xk+1 = i | xk = j, u). (10.16)

For each j, the jth column of Mu must sum to one and can be interpreted as the probability distribution over X that is obtained if uk is applied from state xk = j. Let v denote an n-dimensional column vector that represents any probability distribution over X. The product M_u_v yields a column vector that represents the probability distribution over X that is obtained after starting with v and applying u. The matrix multiplication performs n inner products, each of which is a marginalization as shown in (10.13). The forward projection at any stage, k,

can now be expressed using a product of k − 1 state transition matrices. Suppose

that u˜k−1 is fixed. Let v = [0 0 0 1 0 0], which indicates that x1 is known (with probability one). The forward projection can be computed as

· · · · · ·

v′ = Muk−1 Muk−2 · · · Mu_2 M_u_1 v. (10.17) The ith element of v′ is P(x_k = i | x1, u˜k−1).

Example 10.4 (Probabilistic Forward Projections) Once again, use the first model from Example 10.1; however, now assign probability 1/3 to each nature ac- tion. Assume that, initially, x1 = 0, and u = 2 is applied in every stage. The one-stage forward projection yields probabilities

[1/3 1/3 1/3] (10.18)

over the sequence of states (1, 2, 3). The two-stage forward projection yields

[1/9 2/9 3/9 2/9 1/9] (10.19)

over (2, 3, 4, 5, 6). .

Backprojections Sometimes it is helpful to define the set of possible previous states from which one or more current states could be obtained. For example, they will become useful in defining graph-based planning algorithms in Section 10.2.3. This involves maintaining a backprojection, which is a counterpart to the forward projection that runs in the opposite direction. Backprojections were considered in Section 8.5.2 to keep track of the active states in a Dijkstra-like algorithm over continuous state spaces. In the current setting, backprojections need to address uncertainty.

Consider the case of nondeterministic uncertainty. Let a state x X be given. Under a fixed action u, what previous states, x′ X, could possibly lead to x? This depends only on the possible choices of nature and is expressed as

WB(x, u) = {x′ ∈ X | ∃θ ∈ Θ(x′, u) such that x = f(x′, u, θ)}. (10.20)

The notation WB(x, u) refers to the weak backprojection of x under u, and gives the set of all states from which x may possibly be reached in one stage.

The backprojection is called “weak” because it does not guarantee that x is reached, which is a stronger condition. By guaranteeing that x is reached, a strong backprojection of x under u is defined as

SB(x, u) = {x′ ∈ X | ∀θ ∈ Θ(x′, u), x = f(x′, u, θ)}. (10.21)

The difference between (10.20) and (10.21) is either there exists an action of nature that enables x to be reached, or x is reached for all actions of nature. Note that SB(x, u) WB(x, u). In many cases, SB(x, u) = , and WB(x, u) is rarely empty. The backprojection that was introduced in (8.66) of Section 8.5.2 did not involve uncertainty; hence, the distinction between weak and strong backprojections did not arise.

⊆ ∅

Two useful generalizations will now be made: 1) A backprojection can be defined from a set of states; 2) the action does not need to be fixed. Instead of a fixed state, x, consider a set S X of states. What are the states from which an element of S could possibly be reached in one stage under the application of u? This is the weak backprojection of S under u:

WB(S, u) = {x′ ∈ X | ∃θ ∈ Θ(x′, u) such that f(x′, u, θ) ∈ S}, (10.22) which can also be expressed as

I

WB(S, u) = WB(x, u). (10.23)

xS

Similarly, the strong backprojection of S under u is defined as

SB(S, u) = {x′ ∈ X | ∀θ ∈ Θ(x′, u), f(x′, u, θ) ∈ S}. (10.24)

Note that SB(S, u) cannot be formed by the union of SB(x, u) over all x ∈ S. Another observation is that for each xk SB(S, uk), we have Xk+1(xk, uk) S.

∈ ⊆

Now the dependency on u will be removed. This yields a backprojection of a

set S. These are states from which there exists an action that possibly reaches S. The weak backprojection of S is

WB(S) = {x′ ∈ X | ∃u ∈ U(x) such that x ∈ WB(S, u)}, (10.25) and the strong backprojection of S is

SB(S) = {x′ ∈ X | ∃u ∈ U(x) such that x ∈ SB(S, u)}. (10.26)

Note that SB(S) ⊆ WB(S).

Example 10.5 (Backprojections) Once again, consider the model from the first part of Example 10.1. The backprojection WB(0, 2) represents the set of

all states from which u = 2 can be applied and x = 0 is possibly reached; the result is WB(0, 2) = {−3, −2, −1}. The state 0 cannot be reached with certainty from any state in WB(0, 2). Therefore, SB(0, 2) = ∅.

Now consider backprojections from the goal, XG = 1, 0, 1 , under the action

{− }

u = 2. The weak backprojection is

WB(XG, 2) = WB(−1, 2) ∪ WB(0, 2) ∪ WB(1, 2) = {−4, −3, −2, −1, 0}. (10.27)

The strong backprojection is SB(XG, 2) = {−2}. From any of the other states in WB(XG, 2), nature could cause the goal to be missed. Note that SB(XG, 2) cannot be constructed by taking the union of SB(x, 2) over every x XG.

Finally, consider backprojections that do not depend on an action. These are WB(XG) = {−4, −3, . . . , 4} and SB(XG) = XG. In the latter case, all states in

XG lie in SB(XG) because uT can be applied. Without allowing uT , we would obtain SB(XG) = {−2, 2}. .

Other kinds of backprojections are possible, but we will not define them. One possibility is to make backprojections over multiple stages, as was done for forward projections. Another possibility is to define them for the probabilistic case. This is considerably more complicated. An example of a probabilistic backprojection is to find the set of all states from which a state in S will be reached with at least probability p.

      1. A Plan and Its Execution

In Chapter 2, a plan was specified by a sequence of actions. This was possible because the effect of actions was completely predictable. Throughout most of Part II, a plan was specified as a path, which is a continuous-stage version of the action sequence. Section 8.2.1 introduced plans that are expressed as a function on the state space. This was optional because uncertainty was not explicitly modeled (except perhaps in the initial state).

As a result of unpredictability caused by nature, it is now important to separate the definition of a plan from its execution. The same plan may be executed many times from the same initial state; however, because of nature, different future states will be obtained. This requires the use of feedback in the form of a plan that maps states to actions.

Defining a plan Let a (feedback) plan for Formulation 10.1 be defined as a function π : X U that produces an action π(x) U(x), for each x X. Although the future state may not be known due to nature, if π is given, then it will at least be known what action will be taken from any future state. In other works, π has been called a feedback policy, feedback control law, reactive plan [340], and conditional plan.

→ ∈ ∈

For some problems, particularly when K is fixed at some finite value, a stage- dependent plan may be necessary. This enables a different action to be chosen for

every stage, even from the same state. Let denote the set 1, . . . , K of stages. A stage-dependent plan is defined as π : X U. Thus, an action is given by u = π(x, k). Note that the definition of a K-step plan, which was given Section 2.3, is a special case of the current definition. In that setting, the action depended only on the stage because future states were always predictable. Here they are no longer predictable and must be included in the domain of π. Unless otherwise mentioned, it will be assumed by default that π is not stage-dependent.

× K →

K { }

Note that once π is formulated, the state transitions appear to be a function of only the current state and nature. The next state is given by f(x, π(x), θ). The same is true for the cost term, l(x, π(x), θ).

Forward projections under a fixed plan Forward projections can now be defined under the constraint that a particular plan is executed. The specific ex- pression of actions is replaced by π. Each time an action is needed from a state

x ∈ X, it is obtained as π(x). In this formulation, a different U(x) may be used

for each x ∈ X, assuming that π is correctly defined to use whatever actions are actually available in U(x) for each x ∈ X.

First we will consider the nondeterministic case. Suppose that the initial state x1 and a plan π are known. This means that u1 = π(x1), which can be sub- stituted into (10.10) to compute the one-stage forward projection. To compute the two-stage forward projection, u2 is determined from π(x2) for use in (10.11). A recursive formulation of the nondeterministic forward projection under a fixed plan is

Xk+1(x1, π) = {xk+1 ∈ X | ∃θk ∈ Θ(xk, π(xk)) such that

xk ∈ Xk(x1, π) and xk+1 = f(xk, π(xk), θk)}.

(10.28)

The probabilistic forward projection in (10.10) can be adapted to use π, which results in

P(xk+2|xk, π) =

-

xk+1∈X

P(xk+2|xk+1, π(xk+1))P(xk+1|xk, π(xk)). (10.29)

The basic idea can be applied k − 1 times to compute P(xk|x1, π).

A state transition matrix can be used once again to express the probabilistic forward projection. In (10.15), all columns correspond to the application of the action u. Let Mπ , be the forward projection due to a fixed plan π. Each column of Mπ may represent a different action because each column represents a different state xk. Each entry of Mπ is

mi,j = P(xk+1 = i | xk = j, π(xk)). (10.30)

The resulting Mπ defines a Markov process that is induced under the application of the plan π.

Graph representations of a plan The game against nature involves two de- cision makers: nature and the robot. Once the plan is formulated, the decisions of the robot become fixed, which leaves nature as the only remaining decision maker. Using this interpretation, a directed graph can be defined in the same way as in Section 2.1, except nature actions are used instead of the robot’s actions. It can even be imagined that nature itself faces a discrete feasible planning problem as in Formulation 2.1, in which Θ(x, π(x)) replaces U(x), and there is no goal set. Let

Gπ denote a plan-based state transition graph, which arises under the constraint that π is executed. The vertex set of Gπ is X. A directed edge in Gπ exists from x to x′ if there exists some θ ∈ Θ(x, π(x)) such that x′ = f(x, π(x), θ). Thus, from

each vertex in π , the set of outgoing edges represents all possible transitions to next states that are possible, given that the action is applied according to π. In the case of probabilistic uncertainty, π becomes a weighted graph in which each edge is assigned the probability P(x′ x, π(x), θ). In this case, π corresponds to the graph representation commonly used to depict a Markov chain.

G

| G

G

A nondeterministic forward projection can easily be derived from π by follow- ing the edges outward from the current state. The outward edges lead to the states of the one-stage forward projection. The outward edges of these states lead to the

G

two-stage forward projection, and so on. The probabilistic forward projection can also be derived from Gπ .

The cost of a feedback plan Consider the cost-to-go of executing a plan π from a state x1 X. The resulting cost depends on the sequences of states that are visited, actions that are applied by the plan, and the applied nature actions. In Chapter 2 this was obtained by adding the cost terms, but now there is a dependency on nature. Both worst-case and expected-case analyses are possible, which generalize the treatment of Section 9.2 to state spaces and multiple stages.

Let H(π, x1) denote the set of state-action-nature histories that could arise

from π when applied using x1 as the initial state. The cost-to-go, Gπ (x1), under a given plan π from x1 can be measured using worst-case analysis as

Gπ (x1) = max L(x˜, u˜, θ˜) , (10.31)

J \

(x˜,u˜˜)∈H(_π,x_1)

which is the maximum cost over all possible trajectories from x1 under the plan π. If any of these fail to terminate in the goal, then the cost becomes infinity. In (10.31), x˜, u˜, and θ˜ are infinite histories, although their influence on the cost is expected to terminate early due to the application of uT .

An optimal plan using worst-case analysis is any plan for which Gπ (x1) is minimized over all possible plans (all ways to assign actions to the states). In the case of feasible planning, there are usually numerous equivalent alternatives. Sometimes the task may be only to find a feasible plan, which means that all trajectories must reach the goal, but the cost does not need to be optimized.

Using probabilistic uncertainty, the cost of a plan can be measured using

expected-case analysis as

Gπ (x1) = EH(_π,x_1)「L(x˜, u˜, θ˜)l, (10.32)

in which E denotes the mathematical expectation taken over (π, x1) (i.e., the plan is evaluated in terms of a weighted sum, in which each term has a weight for the probability of a state-action-nature history and its associated cost, L(x˜, u˜, θ˜)). This can also be interpreted as the expected cost over trajectories from x1. If any of these have nonzero probability and fail to terminate in the goal, then

H

Gπ (x1) = ∞. In the probabilistic setting, the task is usually to find a plan that

minimizes Gπ (x1).

An interesting question now emerges: Can the same plan, π, be optimal from every initial state x1 X, or do we need to potentially find a different optimal plan for each initial state? Fortunately, a single plan will suffice to be optimal over all initial states. Why? This behavior was also observed in Section 8.2.1. If π is optimal from some x1, then it must also be optimal from every other state that is potentially visited by executing π from x1. Let x denote some visited state. If π was not optimal from x, then a better plan would exist, and the goal could be reached from x with lower cost. This contradicts the optimality of π because solutions must travel through x. Let π∗ denote a plan that is optimal from every initial state.

results matching ""

    No results matching ""