Algorithms for Computing Feedback Plans
- Value Iteration
Fortunately, the value iteration method of Section 2.3.1.1 extends nicely to handle uncertainty in prediction. This was the main reason why value iteration was introduced in Chapter 2. Value iteration was easier to describe in Section 2.3.1.1 because the complications of nature were avoided. In the current setting, value iteration retains most of its efficiency and can easily solve problems that involve thousands or even millions of states.
The state space, X, is assumed to be finite throughout Section 10.2.1. An extension to the case of a countably infinite state space can be developed if cost- to-go values over the entire space do not need to be computed incrementally.
Only backward value iteration is considered here. Forward versions can be defined alternatively.
Nondeterministic case Suppose that the nondeterministic model of nature is used. A dynamic programming recurrence, (10.39), will be derived. This directly yields an iterative approach that computes a plan that minimizes the worst-case cost. The following presentation shadows that of Section 2.3.1.1; therefore, it may be helpful to refer back to this periodically.
An optimal plan π∗ will be found by computing optimal cost-to-go functions. For 1 k F, let G∗k denote the worst-case cost that could accumulate from stage k to F under the execution of the optimal plan (compare to (2.5))
≤ ≤
G∗k(xk) = min max min max · · · min max
uk
θk
uk+1 θk+1
uK
θK
K
- l(xi, ui, θi) + lF (xF )
(
_
i=k
. (10.33)
i=k
Inside of the min’s and max’s of (10.33) are the last F − k terms of the cost
functional, (10.8). For simplicity, the ranges of each ui and θi in the min’s and max’s of (10.33) have not been indicated. The optimal cost-to-go for k = F is
G∗F (xF ) = lF (xF ), (10.34)
which is the same as (2.6) for the predictable case.
Now consider making K passes over X, each time computing G∗k from G∗k+1, as k ranges from F down to 1. In the first iteration, G∗F is copied from lF . In the
second iteration, G∗K is computed for each xK ∈ X as (compare to (2.7))
GK∗ (xK ) = min max Jl(xK , uK , θK ) + lF (xF )\, (10.35)
uK
θK
in which uK U(xK ) and θK Θ(xK , uK ). Since lF = G∗F and xF = f(xK , uK , θK ), substitutions are made into (10.35) to obtain (compare to (2.8))
∈ ∈
G∗K (xK ) = min max Jl(xK , uK , θK ) + GF∗ (f(xK , uK , θK ))\, (10.36)
uK
θK
which computes the costs of all optimal one-step plans from stage K to stage
F = K + 1.
More generally, Gk∗ can be computed once G∗k+1 is given. Carefully study
(10.33), and note that it can be written as (compare to (2.9))
Gk∗ (xk) = min max ( min max · · · min max (l(xk, uk, θk)+
uk θk
uk+1 θk+1
uK θK
K
i=-k+1
l(xi, ui, θi) + lF (xF )
__
(10.37)
by pulling the first cost term out of the sum and by separating the minimization over uk from the rest, which range from uk+1 to uK . The second min and max do not affect the l(xk, uk, θk) term; thus, l(xk, uk, θk) can be pulled outside to obtain (compare to (2.10))
G∗k(xk) = min max (l(xk, uk, θk)+
uk θk
min max · · · min max - l(xi, ui, θi) + l(xF )__.
K
uk+1 θk+1
uK θK
i=k+1
(
(10.38)
The inner min’s and max’s represent G∗k+1, which yields the recurrence (compare to (2.11))
Gk∗ (xk) = min max l(xk, uk, θk) + Gk∗+1(xk+1) . (10.39)
θk
u ) J J \\
k ∈U (xk
Probabilistic case Now consider the probabilistic case. A value iteration method can be obtained by once again shadowing the presentation in Section 2.3.1.1. For k from 1 to F, let G∗k denote the expected cost from stage k to F under the execution of the optimal plan (compare to (2.5)):
G∗k(xk) = min
uk,...,uK
(Eθk,...,θK
K
「
-
l_
l(xi, ui, θi) + lF (xF )
i=k
. (10.40)
The optimal cost-to-go for the boundary condition of k = F again reduces to (10.34).
Once again, the algorithm makes K passes over X, each time computing G∗k
from G∗k+1, as k ranges from F down to 1. As before, G∗F is copied from lF . In the second iteration, G∗K is computed for each xK ∈ X as (compare to (2.7))
G∗K (xK ) = min JEθK 「l(xK , uK , θK ) + lF (xF )l\, (10.41)
uK
in which uK U(xK ) and the expectation occurs over θK . Substitutions are made into (10.41) to obtain (compare to (2.8))
∈
G∗K (xK ) = min JEθK 「l(xK , uK , θK ) + G∗F (f(xK , uK , θK ))l\. (10.42)
uK
The general iteration is
Gk∗ (xk) = min (Eθk 「 min (Eθk+1,...,θK 「l(xk, uk, θk)+
uk
K
i=-k+1
uk+1,...,uK
l(xi, ui, θi) + lF (xF )
ll,
(10.43)
which is obtained once again by pulling the first cost term out of the sum and by separating the minimization over uk from the rest. The second min and expectation do not affect the l(xk, uk, θk) term, which is pulled outside to obtain (compare to (2.10))
G∗k(xk) = min (Eθk 「l(xk, uk, θk)+
uk
min
uk+1,...,uK
(Eθk+1,...,θK
K
「
i=-k+1
l(xi, ui, θi) + l(xF )
ll.
(10.44)
The inner min and expectation define G∗k+1, yielding the recurrence (compare to (2.11) and (10.39))
G∗k(xk) = min
uk ∈U (xk
) JEθk 「l(xk, uk, θk) + G∗k+1(xk+1)l\
= min
uk ∈U (xk
) J θ ∈Θ-(x ,u )
(l(xk, uk, θk) + G∗k+1(f (xk, uk, θk))\P(θk|xk, uk)\.
k k k
(10.45)
If the cost term does not depend on θk, it can be written as l(xk, uk), and (10.45) simplifies to
G∗k(xk) = min
u
) (l(xk, uk) + -
k+1
G∗k+1(xk+1)P (xk+1|xk, uk)1. (10.46)
k+1
The dependency of state transitions on θk is implicit through the expression of P(xk+1 xk, uk), for which the definition uses P(θk xk, uk) and the state transition equation f. The form given in (10.46) may be more convenient than (10.45) in implementations.
k ∈U (xk
x
∈X
| |
Convergence issues If the maximum number of stages is fixed in the problem definition, then convergence is assured. Suppose, however, that there is no limit on the number of stages. Recall from Section 2.3.2 that each value iteration increases the total path length by one. The actual stage indices were not important in backward dynamic programming because arbitrary shifting of indices does not affect the values. Eventually, the algorithm terminated because optimal cost-to- go values had been computed for all reachable states from the goal. This resulted in a stationary cost-to-go function because the values no longer changed. States that are reachable from the goal converged to finite values, and the rest remained at infinity. The only problem that prevents the existence of a stationary cost-to-go function, as mentioned in Section 2.3.2, is negative cycles in the graph. In this case, the best plan would be to loop around the cycle forever, which would reduce the cost to .
−∞
In the current setting, a stationary cost-to-go function once again arises, but cycles once again cause difficulty in convergence. The situation is, however, more
complicated due to the influence of nature. It is helpful to consider a plan-based state transition graph, Gπ . First consider the nondeterministic case. If there
exists a plan π from some state x1 for which all possible actions of nature cause the traversal of cycles that accumulate negative cost, then the optimal cost-to- go at x1 converges to , which prevents the value iterations from terminating. These cases can be detected in advance, and each such initial state can be avoided (some may even be in a different connected component of the state space).
−∞
It is also possible that there are unavoidable positive cycles. In Section 2.3.2, the cost-to-go function behaved differently depending on whether the goal set was
xI xG xI xG
(a) (b)
Figure 10.2: Plan-based state transition graphs. (a) The goal is possibly reachable, but not guaranteed reachable because an infinite cycle could occur. (b) The goal is guaranteed reachable because all flows lead to the goal.
reachable. Due to nature, the goal set may be possibly reachable or guaranteed reachable, as illustrated in Figure 10.2. To be possibly reachable from some initial state, there must exist a plan, π, for which there exists a sequence of nature actions that will lead the state into the goal set. To be guaranteed reachable, the goal must be reached in spite of all possible sequences of nature actions. If the goal is possibly reachable, but not guaranteed reachable, from some state x1 and all edges have positive cost, then the cost-to-go value of x1 tends to infinity as the value iterations are repeated. For example, every plan-based state transition graph may contain a cycle of positive cost, and in the worst case, nature may cause the state to cycle indefinitely. If convergence of the value iterations is only evaluated at states from which the goal set is guaranteed to be reachable, and if there are no negative cycles, then the algorithm should terminate when all cost-to-go values remain unchanged.
For the probabilistic case, there are three situations:
- The value iterations arrive at a stationary cost-to-go function after a finite number of iterations.
- The value iterations do not converge in any sense.
- The value iterations converge only asymptotically to a stationary cost-to-go function. The number of iterations tends to infinity as the values converge.
The first two situations have already occurred. The first one occurs if there exists a plan, π, for which π has no cycles. The second situation occurs if there are neg- ative or positive cycles for which all edges in the cycle have probability one. This situation is essentially equivalent to that for the nondeterministic case. Worst-case analysis assumes that the worst possible nature actions will be applied. For the probabilistic case, the nature actions are forced by setting their probabilities to one.
G
The third situation is unique to the probabilistic setting. This is caused by positive or negative cycles in π for which the edges have probabilities in (0, 1). The optimal plan may even have such cycles. As the value iterations consider
G
xI xG
1
Figure 10.3: A plan-based state transition graph that causes asymptotic conver- gence. The probabilities of the transitions are shown on the edges. Longer and longer paths exist by traversing the cycle, but the probabilities become smaller.
longer and longer paths, a cycle may be traversed more times. However, each time the cycle is traversed, the probability diminishes. The probabilities diminish exponentially in terms of the number of stages, whereas the costs only accumulate linearly. The changes in the cost-to-go function gradually decrease and converge only to stationary values as the number of iterations tends to infinity. If some approximation error is acceptable, then the iterations can be terminated once the maximum change over all of X is within some ǫ threshold. The required number of value iterations to obtain a solution of the desired quality depends on the probabilities of following the cycles and on their costs. If the probabilities are lower, then the algorithm converges sooner.
Example 10.6 (A Cycle in the Transition Graph) Suppose that a plan, π, is chosen that yields the plan-based state transition graph shown in Figure 10.3. A probabilistic model is used, and the probabilities are shown on each edge. For simplicity, assume that each transition results in unit cost, l(x, u, θ) = 1, over all x, u, and θ.
The expected cost from xI is straightforward to compute. With probability 1/2, the cost to reach xG is 3. With probability 1/4, the cost is 7. With probability 1/8, the cost is 11. Each time another cycle is taken, the cost increases by 4, but the probability is cut in half. This leads to the infinite series
∞ 1
-
Gπ (xI ) = 3 + 4 2i = 7. (10.47)
i=1
The infinite sum is the standard geometric series and converges to 1; hence (10.47) converges to 7.
Even though the cost converges to a finite value, this only occurs in the limit. An infinite number of value iterations would theoretically be required to obtain this result. For most applications, an approximate solution suffices, and very high precision can be obtained with a small number of iterations (e.g., after 20 iterations, the change is on the order of one-billionth). Thus, in general, it is sensible to terminate the value iterations after the maximum cost-to-go change is less than a threshold based directly on precision.
Note that if nondeterministic uncertainty is used, then the value iterations do not converge because, in the worst case, nature will cause the state to cycle forever. Even though the goal is not guaranteed reachable, the probabilistic uncertainty
model allows reasonable solutions. .
Using the plan Assume that there is no limit on the number of stages. After the value iterations terminate, cost-to-go functions are determined over X. This is not exactly a plan, because an action is required for each x X. The actions can be obtained by recording the u U(x) that produced the minimum cost value in (10.45) or (10.39).
∈
∈
Assume that the value iterations have converged to a stationary cost-to-go function. Before uncertainty was introduced, the optimal actions were determined by (2.19). The nondeterministic and probabilistic versions of (2.19) are
π∗(x) = argmin J θ max ) Jl(x, u, θ) + G∗(f(x, u, θ))\\ (10.48)
and
J 「 l\
u∈U (x)
∈Θ(x,u
π∗(x) = argmin Eθ l(x, u, θ) + G∗(f(x, u, θ)) , (10.49)
u∈U (x)
respectively. For each x X at which the optimal cost-to-go value is known, one evaluation of (10.45) yields the best action.
∈
Conveniently, the optimal action can be recovered directly during execution of the plan, rather than storing actions. Each time a state xk is obtained during execution, the appropriate action uk = π∗(xk) is selected by evaluating (10.48) or (10.49) at xk. This means that the cost-to-go function itself can be interpreted as a representation of the optimal plan, once it is understood that a local operator is required to recover the action. It may seem strange that such a local computation yields the global optimum; however, this works because the cost-to-go function already encodes the global costs. This behavior was also observed for continuous state spaces in Section 8.4.1, in which a navigation function served to define a feedback motion plan. In that context, a gradient operator was needed to recover the direction of motion. In the current setting, (10.48) and (10.49) serve the same purpose.
- Policy Iteration
The value iterations of Section 10.2.1 work by iteratively updating cost-to-go values on the state space. The optimal plan can alternatively be obtained by iteratively searching in the space of plans. This leads to a method called policy iteration [84]; the term policy is synonymous with plan. The method will be explained for the case of probabilistic uncertainty, and it is assumed that X is finite. With minor adaptations, a version for nondeterministic uncertainty can also be developed.
Policy iteration repeatedly requires computing the cost-to-go for a given plan, π. Recall the definition of Gπ from (10.32). First suppose that there are no uncertainties, and that the state transition equation is x′ = f(x, u). The dynamic programming equation (2.18) from Section 2.3.2 can be used to derive the cost-
to-go for each state x ∈ X under the application of π. Make a copy of (2.18) for each x ∈ X, and instead of the min, use the given action u = π(x), to yield
Gπ (x) = l(x, π(x)) + Gπ (f(x, π(x))). (10.50)
In (10.50), the G∗ has been replaced by Gπ because there are no variables to optimize (it is simply the cost of applying π). Equation (10.50) can be thought of as a trivial form of dynamic programming in which the choice of possible plans has been restricted to a single plan, π. If the dynamic programming recurrence (2.18) holds over the space of all plans, it must certainly hold over a space that consists of a single plan; this is reflected in (10.50).
If there are n states, (10.50) yields n equations, each of which gives an expres- sion of Gπ (x) for a different state. For the states in which x ∈ XG, it is known that
Gπ (x) = 0. Now that this is known, the cost-to-go for all states from which XG can be reached in one stage can be computed using (10.50) with Gπ (f(x, π(x))) = 0. Once these cost-to-go values are computed, another wave of values can be com- puted from states that can reach these in one stage. This process continues until the cost-to-go values are computed for all states. This is similar to the behavior of Dijkstra’s algorithm.
This process of determining the cost-to-go should not seem too mysterious. Equation (10.50) indicates how the costs differ between neighboring states in the state transition graph. Since all of the differences are specified and an initial condition is given for XG, all others can be derived by adding up the differences expressed in (10.50). Similar ideas appear in the Hamilton-Jacobi-Bellman equa- tion and Pontryagin’s minimum principle, which are covered in Section 15.2.
Now we turn to the case in which there are probabilistic uncertainties. The probabilistic analog of (2.18) is (10.49). For simplicity, consider the special case in which l(x, u, θ) does not depend on θ, which results in
π∗(x) = argmin (l(x, u) + - G∗(x′)P(x′|x, u)1, (10.51)
u∈U (x)
x′∈X
u∈U (x)
x′∈X
in which x′ = f(x, u). The cost-to-go function, G∗, satisfies the dynamic program- ming recurrence
G∗(x) = min
u∈U (x)
l(x, u) + G∗(x′)P(x′|x, u) . (10.52)
x′∈X
( - 1
The probabilistic analog to (10.50) can be made from (10.52) by restricting the set of actions to a single plan, π, to obtain
Gπ (x) = l(x, π(x)) + Gπ (x′)P(x′|x, π(x)), (10.53)
-
x′∈X
POLICY ITERATION ALGORITHM
- Pick an initial plan π, in which uT is applied at each x XG and all other actions are chosen arbitrarily.
∈
- Use (10.53) to compute Gπ for each x ∈ X under the plan π.
- Substituting the computed Gπ values for G∗, use (10.51) to compute a better plan, π′:
π′(x) = argmin Jl(x, u) + - Gπ (x′)P(x′|x, u)\. (10.54)
u∈U (x)
x′∈X
u∈U (x)
x′∈X
- If π′ produces at least one lower cost-to-go value than π, then let π = π′ and repeat Steps 2 and 3. Otherwise, declare π to be the optimal plan, π∗.
Figure 10.4: The policy iteration algorithm iteratively searches the space of plans by evaluating and improving plans.
in which x′ is the next state.
The cost-to-go for each x ∈ X under the application of π can be determined by
writing (10.53) for each state. Note that all quantities except Gπ are known. This means that if there are n states, then there are n linear equations and n unknowns (Gπ (x) for each x X). The same was true when (10.50) was used, except the equations were much simpler. In the probabilistic setting, a system of n linear equations must be solved to determine Gπ . This may be performed using classical linear algebra techniques, such as singular value decomposition (SVD) [399, 961]. Now that we have a method for evaluating the cost of a plan, the policy iteration method is straightforward, as specified in Figure 10.4. Note that in Step 3, the cost-to-go Gπ , which was developed for one plan, π, is used to evaluate other plans. The result is the cost that will be obtained if a new action is tried in the first stage and then π is used for all remaining stages. If a new action cannot reduce the cost, then π must have already been optimal because it means that (10.54) has become equivalent to the stationary dynamic programming equation, (10.49). If it is possible to improve π, then a new plan is obtained. The new plan must be strictly better than the previous plan, and there is only a finite number of possible plans in total. Therefore, the policy iteration method converges after
∈
a finite number of iterations.
Example 10.7 (An Illustration of Policy Iteration) A simple example will now be used to illustrate policy iteration. Let X = {a, b, c} and U = {1, 2, uT }.
Let XG = c . Let l(x, u) = 1 for all x X and u U uT (if uT is applied, there is no cost). The probabilistic state transition graphs for each action are shown in Figure 10.5. The first step is to pick an initial plan. Let π(a) = 1 and
{ } ∈ ∈ \ { }
π(b) = 1; let π(c) = uT because c ∈ XG.
1/_3 _a
1_/_3
1_/_3
b c
1_/_2
1_/_2
1_/_3
1_/_3
1_/_3
xG a
b
1_/_4
c
3/_4 _xG
u=1 u=2
Figure 10.5: The probabilistic state transition graphs for u = 1 and u = 2. Transitions out of c are not shown because it is assumed that a termination action is always applied from xg .
Now use (10.53) to compute Gπ . This yields three equations:
Gπ (a) = 1 + Gπ (a)P(a | a, 1) + Gπ (b)P(b | a, 1) + Gπ (c)P(c | a, 1) (10.55)
Gπ (b) = 1 + Gπ (a)P(a | b, 1) + Gπ (b)P(b | b, 1) + Gπ (c)P(c | b, 1) (10.56)
Gπ (c) = 0 + Gπ (a)P(a | c, uT ) + Gπ (b)P(b | c, uT ) + Gπ (c)P(c | c, uT ). (10.57)
Each equation represents a different state and uses the appropriate action from π. The final equation reduces to Gπ (c) = 0 because of the basic rules of applying a termination condition. After substituting values for P(x′ x, u) and using Gπ (c) = 0, the other two equations become
|
Gπ (a) = 1 + 1 Gπ (a) + 1 Gπ (b) (10.58)
3 3
and
Gπ (b) = 1 + 1 Gπ (a) + 1 Gπ (b). (10.59)
3 3
The solutions are Gπ (a) = Gπ (b) = 3.
Now use (10.54) for each state with Gπ (a) = Gπ (b) = 3 and Gπ (c) = 0 to find a better plan, π′. At state a, it is found by solving
π′(a) = argmin Jl(x, a) + - Gπ (x′)P(x′|x, a)\. (10.60)
u∈U
x′∈X
u∈U
x′∈X
The best action is u = 2, which produces cost 5/2 and is computed as
l(x, 2) + - Gπ (x′)P(x′|x, 2) = 1 + 0 + (3) 1
2
x′∈X
- (0) 1 = 5 . (10.61)
x′∈X
Thus, π′(a) = 2. Similarly, π′(b) = 2 can be computed, which produces cost 7/4. Once again, π′(c) = uT , which completes the determination of an improved plan. Since an improved plan has been found, replace π with π′ and return to Step
4
2
2. The new plan yields the equations
Gπ (a) = 1 + 1 Gπ (b) (10.62)
2
and
Gπ (b) = 1 + 1 Gπ (a). (10.63)
4
Solving these yields Gπ (a) = 12/7 and Gπ (b) = 10/7. The next step attempts to find a better plan using (10.54), but it is determined that the current plan cannot be improved. The policy iteration method terminates by correctly reporting that
π∗ = π. .
Policy iteration may appear preferable to value iteration, especially because it usually converges in fewer iterations than value iteration. The equation solving that determines the cost of a plan effectively considers multiple stages at once. However, for most planning problems, X is large and the large linear system of equations that must be solved at every iteration can become unwieldy. In some applications, either the state space may be small enough or sparse matrix techniques may allow efficient solutions over larger state spaces. In general, value- based methods seem preferable for most planning problems.
- Graph Search Methods
Value iteration is quite general; however, in many instances, most of the time is wasted on states that do not update their values because either the optimal cost- to-go is already known or the goal is not yet reached. Policy iteration seems to alleviate this problem, but it is limited to small state spaces. These shortcomings motivate the consideration of alternatives, such as extending the graph search methods of Section 2.2. In some cases, Dijkstra’s algorithm can even be extended to quickly obtain optimal solutions, but a strong assumption is required on the structure of solutions. In the nondeterministic setting, search methods can be developed that produce only feasible solutions, without regard for optimality. For the methods in this section, X need not be finite, as long as the search method is systematic, in the sense defined in Section 2.2.
Backward search with backprojections A backward search can be con- ducted by incrementally growing a plan outward from XG by using backprojec- tions. A complete algorithm for computing feasible plans under nondeterministic uncertainty is outlined in Figure 10.6. Let S denote the set of states for which the plan has been computed. Initially, S = XG and, if possible, S may grow until S = X. The plan definition starts with π(x) = uT for each x XG and is incrementally extended to new states during execution.
∈
Step 2 takes every state x that is not already in S and checks whether it should be added. This requires determining whether some action, u, can be applied from x, with the next state guaranteed to lie in S, as shown in Figure 10.7. If so, then π(x) = u is assigned and S is extended to include x. If no such progress can be made, then the algorithm must terminate. Otherwise, every state is checked again
BACKPROJECTION ALGORITHM
- Initialize S = XG, and let π(x) = uT for each x ∈ XG.
- For each x X S, if there exists some u U(x) such that x SB(S, u) then: 1) let π(x) = u, and 2) insert x into S.
∈ \ ∈ ∈
- If Step 2 failed to extend S, then exit. This implies that SB(S) = S, which means no more progress can be made. Otherwise, go to Step 2.
Figure 10.6: A general algorithm for computing a feasible plan under nondeter- ministic uncertainty.
Forward projection
x under u
S
Figure 10.7: A state x can be added to S if there exists an action u U(x) such that the one-stage forward projection is contained in S.
∈
by returning to Step 2. This is necessary because S has grown, and in the next iteration new states may lie in its strong backprojection.
For efficiency reasons, the X S set in Step 2 may be safely replaced with the smaller set, WB(S) S, because it is impossible for other states in X to be affected. Depending on the problem, this condition may provide a quick way to prune many hopeless states from consideration. As an example, consider a grid- like environment in which a maximum of two steps in any direction is possible at a given time. A simple distance test can be implemented to eliminate many states from possible inclusion into S in Step 2.
\
\
As long as the consideration of states to include in S is systematic, as considered in Section 2.2, numerous variations of the algorithm in Figure 10.6 are possible. One possibility is to keep track of the cost-to-go and grow S based on incrementally inserting minimal-cost states. This leads to a nondeterministic version of Dijkstra’s algorithm, which is covered next.
Nondeterministic Dijkstra Figure 10.8 shows an extension of Dijkstra’s al- gorithm for solving the problem of Formulation 10.1 under nondeterministic un- certainty. It can also be considered as a variant of the algorithm in Figure 10.6
NONDETERMINISTIC DIJKSTRA
- Initialize S = ∅ and A = XG. Associate uT with every x ∈ A. Assign
G(x) = 0 for all x ∈ A and G(x) = ∞ for all other states.
- Unless A is empty, remove the xs A and its corresponding u, for which G
∈
is smallest. If A was empty, then exit (no further progress is possible).
- Designate π∗(xs) = u as part of the optimal plan and insert xs into S. Declare G∗(xs) = G(xs).
- Compute G(x) using (10.64) for any x in the frontier set, Front(xs, S), and insert Front(xs, S) into A and with associated actions for each inserted state. For states already in A, retain whichever G value is lower, either its original value or the new computed value. Go to Step 2.
Figure 10.8: A Dijkstra-based algorithm for computing an optimal feasible plan under nondeterministic uncertainty.
because it grows S by using backprojections. The algorithm in Figure 10.8 rep- resents a backward-search version of Dijkstra’s algorithm; therefore, it maintains the worst-case cost-to-go, G, which sometimes becomes the optimal, worst-case cost-to-go, G∗. Initially, G = 0 for states in the goal, and G = for all others.
∞
Step 1 performs the initialization. Step 2 selects the state in A that has the smallest value. As in Dijkstra’s algorithm for deterministic problems, it is known that the cost-to-go for this state is the smallest possible. It is therefore declared in Step 3 that G∗(xs) = G(xs), and π∗ is extended to include xs.
Step 4 updates the costs for some states and expands the active set, A. Which costs could be immediately affected by the insertion of xs into S? These are
states xk ∈ X \ S for which there exists some uk ∈ U(xk) that produces a one-
stage forward projection, Xk+1(xk, uk), such that: 1) xs ∈ Xk+1(xk, uk) and 2)
Xk+1(xk, uk) ⊆ S. This is depicted in Figure 10.9. Let the set of states that
satisfy these constraints be called the frontier set, denoted by Front(xs, S). For each x Front(xs, S), let Uf (x) U(x) denote the set of all actions for which the forward projection satisfies the two previous conditions.
∈ ⊆
The frontier set can be interpreted in terms of backprojections. The weak backprojection WB(xs) yields all states that can possibly reach xs in one step. However, the cost-to-go is only finite for states in SB(S) (here S already includes xs). The states in S should certainly be excluded because their optimal costs are
already known. These considerations reduce the set of candidate frontier states to (WB(xs) ∩ SB(S)) \ S. This set is still too large because the same action, u, must
produce a one-stage forward projection that includes xs and is a subset of S. The worst-case cost-to-go is computed for all x ∈ Front(xs, S) as
G(x) = min max l(x, u, θ) + G(f(x, u, θ)) , (10.64)
J J \\
u ) θ )
∈Uf (x
∈Θ(x,u
x
Figure 10.9: The worst-case cost-to-go is computed for any state x such that there exists a u ∈ U(x) for which the one-stage forward projection is contained in the
updated S and one state in the forward projection is xs.
in which the restricted action set, Uf (x), is used. If x was already in A and a previous G(x) was computed, then the minimum of its previous value and (10.64) is kept.
Probabilistic Dijkstra A probabilistic version of Dijkstra’s algorithm does not exist in general; however, for some problems, it can be made to work. The algo- rithm in Figure 10.8 is adapted to the probabilistic case by using
G(x) = min Eθ l(x, u, θ) + G(f(x, u, θ)) (10.65)
u ) J 「 l\∈U (x
f
in the place of (10.64). The definition of Front remains the same, and the nonde- terministic forward projections are still applied to the probabilistic problem. Only edges in the transition graph that have nonzero probability are actually consid- ered as possible future states. Edges with zero probability are precluded from the forward projection because they cannot affect the computed cost values.
The probabilistic version of Dijkstra’s algorithm can be successfully applied if there exists a plan, π, for which from any xk ∈ X there is probability one that
Gπ (xk+1) < Gπ (xk). What does this condition mean? From any xk, all possible next states that have nonzero probability of occurring must have a lower cost value. If all edge costs are positive, this means that all paths in the multi-stage forward projection will make monotonic progress toward the goal. In the deterministic case, this always occurs if l(x, u) is always positive. If nonmonotonic paths are possible, then Dijkstra’s algorithm breaks down because the region in which cost- to-go values change is no longer contained within a propagating band, which arises in Dijkstra’s algorithm for deterministic problems.