Infinite-Horizon Problems
In stochastic control theory and artificial intelligence research, most problems con- sidered to date do not specify a goal set. Therefore, there are no associated termi- nation actions. The task is to develop a plan that minimizes the expected cost (or maximize expected reward) over some number of stages. If the number of stages is finite, then it is straightforward to apply the value iteration method of Section
10.2.1. The adapted version of backward value iteration simply terminates when the first stage is reached. The problem becomes more challenging if the number of stages is infinite. This is called an infinite-horizon problem.
The number of stages for the planning problems considered in Section 10.1 is also infinite; however, it was expected that if the goal could be reached, termination would occur in a finite number of iterations. If there is no termination condition, then the costs tend to infinity. There are two alternative cost models that force the costs to become finite. The discounted cost model shrinks the per-stage costs as the stages extend into the future; this yields a geometric series for the total cost that converges to a finite value. The average cost-per-stage model divides the total cost by the number of stages. This essentially normalizes the accumulating cost, once again preventing its divergence to infinity. Some of the computation methods of Section 10.2 can be adapted to these models. This section formulates these two infinite-horizon cost models and presents computational solutions.
- Problem Formulation
Both of the cost models presented in this section were designed to force the cumu- lative cost to become finite, even though there is an infinite number of stages. Each can be considered as a minor adaptation of cost functional used in Formulation 10.1.
The following formulation will be used throughout Section 10.3.
Formulation 10.2 (Infinite-Horizon Problems)
- A nonempty, finite state space X.
- For each state x X, a finite action space U(x) (there is no termination action, contrary to Formulation 10.1).
∈
- A finite nature action space Θ(x, u) for each x ∈ X and u ∈ U(x).
- A state transition function f that produces a state, f(x, u, θ), for every
x ∈ X, u ∈ U(x), and θ ∈ Θ(x, u).
- A set of stages, each denoted by k, that begins at k = 1 and continues indefinitely.
- A stage-additive cost functional, L(x˜, u˜, θ˜), in which x˜, u˜, and θ˜ are infinite state, action, and nature histories, respectively. Two alternative forms of L will be given shortly.
In comparison to Formulation 10.1, note that here there is no initial or goal state. Therefore, there are no termination actions. Without the specification of a goal set, this may appear to be a strange form of planning. A feedback plan, π, still takes the same form; π(x) produces an action u U(x) for each x X.
∈ ∈
As a possible application, imagine a robot that delivers materials in a factory from several possible source locations to several destinations. The robot operates over a long work shift and has a probabilistic model of when requests to deliver materials will arrive. Formulation 10.2 can be used to define a problem in which the goal is to minimize the average amount of time that materials wait to be delivered. This strategy should not depend on the length of the shift; therefore, an infinite number of stages is reasonable. If the shift is too short, the robot may focus only on one delivery, or it may not even have enough time to accomplish that.
Discounted cost In Formulation 10.2, the cost functional in Item 6 must be defined carefully to ensure that finite values are always obtained, even though the number of stages tends to infinity. The discounted cost model provides one simple way to achieve this by rapidly decreasing costs in future stages. Its definition is
based on the standard geometric series. For any real parameter α ∈ (0, 1),
lim
K→∞
K
αk =
(
-
\
k=0
1 1 − α
. (10.66)
The simplest case, α = 1/2, yields 1+1/2+1/4+1/8+ , which clearly converges to 2.
· · ·
Now let α (0, 1) denote a discount factor, which is applied in the definition of a cost functional:
∈
L(x˜, u˜, θ˜) = lim
K→∞
K
αk_l(x_k, uk, θk)
(
-
\
k=0
. (10.67)
Let lk denote the cost, l(xk, uk, θk), received at stage k. For convenience in this setting, the first stage is k = 0, as opposed to k = 1, which has been used previously. As the maximum stage, K, increases, the diminished importance of costs far in the future can easily be observed, as indicated in Figure 10.10.
The rate of cost decrease depends strongly on α. For example, if α = 1/2, the costs decrease very rapidly. If α = 0.999, the convergence to zero is much slower. The trade-off is that with a large value of α, more stages are taken into account, and the designed plan is usually of higher quality. If a small value of α is used, methods such as value iteration converge much more quickly; however, the solution quality may be poor because of “short sightedness.”
Stage L∗K
K = 0 l0
K = 1 l0 + αl1
K = 2 l0 + αl1 + α2l2
K = 3 l0 + αl1 + α2l2 + α3l3
K = 4 l0 + αl1 + α2l2 + α3l3 + α4l4
...
Figure 10.10: The cost magnitudes decease exponentially over the stages. The term l(xk, uk, θk) in (10.67) assumes different values depending on xk, uk,
and θk. Since there are only a finite number of possibilities, they must be bounded by some positive constant c.1 Hence,
lim
K→∞
K
αk_l(x_k, uk, θk)
(
-
\
k=0
lim
K→∞
≤
K
α_k_c ≤
(
-
\
k=0
c
1 − α
, (10.68)
which means that L(x˜, u˜, θ˜) is bounded from above, as desired. A similar lower bound can be constructed, which ensures that the resulting total cost is always finite.
Average cost-per-stage An alternative to discounted cost is to use the average cost-per-stage model, which keeps the cumulative cost finite by dividing out the total number of stages:
L(x˜, u˜, θ˜) = lim
K 1
- l(xk, uk, θk)
( 1 −
K→∞
K
\
k=0
. (10.69)
k=0
Using the maximum per-stage cost bound c, it is clear that (10.69) grows no larger than c, even as K . This model is sometimes preferable because the cost does not depend on an arbitrary parameter, α.
→ ∞
- Solution Techniques
Straightforward adaptations of the value and policy iteration methods of Section
10.2 exist for infinite-horizon problems. These will be presented here; however, it is important to note that many other important issues exist regarding their convergence and numerical stability [96]. There are several other variants of these algorithms that are too involved to cover here but nevertheless are important because they address many of these additional issues. The main point in this section is to understand the simple relationship to the problems considered so far in Sections 10.1 and 10.2.
1The state space X may even be infinite, but this requires that the set of possible costs is bounded.
Value iteration for discounted cost A backward value iteration solution will be presented that follows naturally from the method given in Section 10.2.1. For notational convenience, let the first stage be designated as k = 0 so that αk−1 may be replaced by αk. In the probabilistic case, the expected optimal cost-to-go is
G∗(x) = lim
K→∞
min
u˜
(
(Eθ˜
K
αk_l(x_k, uk, θk)
「
-
k=1
l_\
. (10.70)
The expectation is taken over all nature histories, each of which is an infinite sequence of nature actions. The corresponding expression for the nondeterministic case is
(
(
(
G∗(x) = lim
K→∞
min
u˜
max
θ˜
K
αk_l(x_k, uk, θk)
-
k=1
__\
. (10.71)
Since the probabilistic case is more common, it will be covered here. The nondeterministic version is handled in a similar way (see Exercise 17). As before, backward value iterations will be performed because they are simpler to express. The discount factor causes a minor complication that must be fixed to make the dynamic programming recurrence work properly.
One difficulty is that the stage index now appears in the cost function, in the form of αk. This means that the shift-invariant property from Section 2.3.1.1 is no longer preserved. We must therefore be careful about assigning stage indices. This is a problem because for backward value iteration the final stage index has been unknown and unimportant.
Consider a sequence of discounted decision-making problems, by increasing the maximum stage index: K = 0, K = 1, K = 2, . . .. Look at the neighboring cost expressions in Figure 10.10. What is the difference between finding the optimal cost-to-go for the K + 1-stage problem and the K-stage problem? In Figure 10.10 the last four terms of the cost for K = 4 can be obtained by multiplying all terms for K = 3 by α and adding a new term, l0. The only difference is that the stage indices need to be shifted by one on each li that was borrowed from the K = 3 case. In general, the optimal costs of a K-stage optimization problem can serve as the optimal costs of the K + 1-stage problem if they are first multiplied by α. The K + 1-stage optimization problem can be solved by optimizing over the sum of the first-stage cost plus the optimal cost for the K-stage problem, discounted by α.
This can be derived using straightforward dynamic programming arguments as follows. Suppose that K is fixed. The cost-to-go can be expressed recursively for k from 0 to K as
G∗k(xk) = min
uk ∈U (xk )
JEθk
「αk_l(x_k, uk, θk) + G∗k+1(xk+1)l\, (10.72)
in which xk+1 = f(xk, uk, θk). The problem, however, is that the recursion depends on k through αk, which makes it appear nonstationary.
The idea of using neighboring cost values as shown in Figure 10.10 can be applied by making a notational change. Let JK∗ −k(xk) = α−k_G∗_k(xk). This reverses the direction of the stage indices to avoid specifying the final stage and also scales by α−k to correctly compensate for the index change. Substitution into (10.72)
yields
αk_J_K∗ k(xk) = min
−
uk ∈U (xk )
JEθk
「αk_l(x_k, uk, θk) + αk+1JK∗ −k−1(xk+1)l\. (10.73)
Dividing by αk and then letting i = K − k yields
u ) J 「 l\
Ji∗(xk) = min Eθk l(xk, uk, θk) + αJi∗ 1(xk+1) , (10.74)
−
k ∈U (xk
in which Ji∗ represents the expected cost for a finite-horizon discounted problem in which K = i. Note that (10.74) expresses Ji∗ in terms of Ji∗ 1, but xk is given, and the right-hand side uses xk+1. The indices appear to ru in opposite directions
n
−
because this is simply backward value iteration with a notational change that reverses some of the indices. The particular stage indices of xk and xk+1 are not important in (10.74), as long as xk+1 = f(xk, uk, θk) (for example, the substitutions
x = xk, x′ = xk+1, u = uk, and θ = θk can be safely made).
Value iteration proceeds by first letting J0∗(x0) = 0 for all x X. Successive cost-to-go functions are computed by iterating (10.74) over the state space. Un-
∈
der the cycle-avoiding assumptions of Section 10.2.1, the convergence is usually asymptotic due to the infinite horizon. The discounting gradually causes the cost differences to diminish until they are within the desired tolerance. The stationary form of the dynamic programming recurrence, which is obtained in the limit as i tends to infinity, is
J∗(x) = min Eθk l(x, u, θ) + αJ∗(f(x, u, θ)) . (10.75)
J 「 l\
u∈U (x)
If the cost terms do not depend on nature, then the simplified form is
J∗(x) = min
u∈U (x)
l(x, u) + α J∗(x′)P(x′|x, u) . (10.76)
x′∈X
J - \
As explained in Section 10.2.1, the optimal action, π∗(x), is assigned as the u U(x) that satisfies (10.75) or (10.76) at x.
∈
Policy iteration for discounted cost The policy iteration method may al- ternatively be applied to the probabilistic discounted-cost problem. Recall the method given in Figure 10.4. The general approach remains the same: A search is conducted over the space of plans by solving a linear system of equations in each iteration. In Step 2, (10.53) is replaced by
Jπ (x) = l(x, u) + α Jπ (x′)P(x′|x, u), (10.77)
-
x′∈X
which is a special form of (10.76) for evaluating a fixed plan. In Step 3, (10.54) is replaced by
π′(x) = argmin Jl(x, u) + α - Jπ (x′)P(x′|x, u)\. (10.78)
u∈U (x)
x′∈X
u∈U (x)
x′∈X
Using these alterations, the policy iteration algorithm proceeds in the same way as in Section 10.2.2.
Solutions for the average cost-per-stage model A value iteration algorithm for the average cost model can be obtained by simply neglecting to divide by K. Selecting actions that optimize the total cost also optimizes the average cost as the number of stages approaches infinity. This may cause costs to increase toward
; however, only a finite number of iterations can be executed in practice.
±∞
The backward value iterations of Section 10.2.1 can be followed with very little modification. Initially, let G∗(xF ) = 0 for all xF X. The value iterations are computed using the standard form
∈
Gk∗ (xk) = min
uk ∈U (xk
) ( θ∈Θ-(x ,u )
(l(xk, uk, θk) + G∗k+1(f (xk, uk, θk))\P(θk|xk, uk)1.
k k
(10.79)
The iterations continue until convergence occurs. To determine whether a solution of sufficient quality has been obtained, a reasonable criterion for is
max Gk∗ (x)/N Gk∗+1(x)/(N 1) < ǫ, (10.80)
JII − − II\
x∈X
in which ǫ is the error tolerance and N is the number of value iterations that have been completed (it is required in (10.80) that N > 1). Once (10.80) has been satisfied, the iterations can be terminated.
A numerical problem may exist with the growing values obtained for G∗(x). This can be alleviated by periodically reducing all values by some constant factor to ensure that the numbers fit within the allowable floating point range. In [96], a method called relative value iteration is presented, which selects one state, s X, arbitrarily and expresses the cost-to-go values by subtracting off the cost at s. This trims down all values simultaneously to keep them bounded while still maintaining the convergence properties of the algorithm.
∈
Policy iteration can alternatively be performed by using the method given in Figure 10.4 with only minor modification.