Infinite-Horizon Problems

In stochastic control theory and artiﬁcial intelligence research, most problems con- sidered to date do not specify a goal set. Therefore, there are no associated termi- nation actions. The task is to develop a plan that minimizes the expected cost (or maximize expected reward) over some number of stages. If the number of stages is ﬁnite, then it is straightforward to apply the value iteration method of Section

10.2.1. The adapted version of backward value iteration simply terminates when the ﬁrst stage is reached. The problem becomes more challenging if the number of stages is inﬁnite. This is called an infinite-horizon problem.

The number of stages for the planning problems considered in Section 10.1 is also inﬁnite; however, it was expected that if the goal could be reached, termination would occur in a ﬁnite number of iterations. If there is no termination condition, then the costs tend to inﬁnity. There are two alternative cost models that force the costs to become ﬁnite. The discounted cost model shrinks the per-stage costs as the stages extend into the future; this yields a geometric series for the total cost that converges to a ﬁnite value. The average cost-per-stage model divides the total cost by the number of stages. This essentially normalizes the accumulating cost, once again preventing its divergence to inﬁnity. Some of the computation methods of Section 10.2 can be adapted to these models. This section formulates these two inﬁnite-horizon cost models and presents computational solutions.

- 1. Problem Formulation

Both of the cost models presented in this section were designed to force the cumu- lative cost to become ﬁnite, even though there is an inﬁnite number of stages. Each can be considered as a minor adaptation of cost functional used in Formulation 10.1.

The following formulation will be used throughout Section 10.3.

Formulation 10.2 (Infinite-Horizon Problems)

- - 1. A nonempty, ﬁnite state space X.
    2. For each state x X, a ﬁnite action space U(x) (there is no termination action, contrary to Formulation 10.1).

∈

- - 1. A ﬁnite nature action space Θ(x, u) for each x ∈ X and u ∈ U(x).
    2. A state transition function f that produces a state, f(x, u, θ), for every

x ∈ X, u ∈ U(x), and θ ∈ Θ(x, u).

- - 1. A set of stages, each denoted by k, that begins at k = 1 and continues indeﬁnitely.
    2. A stage-additive cost functional, L(x˜, u˜, θ˜), in which x˜, u˜, and θ˜ are inﬁnite state, action, and nature histories, respectively. Two alternative forms of L will be given shortly.

In comparison to Formulation 10.1, note that here there is no initial or goal state. Therefore, there are no termination actions. Without the speciﬁcation of a goal set, this may appear to be a strange form of planning. A feedback plan, π, still takes the same form; π(x) produces an action u U(x) for each x X.

∈ ∈

As a possible application, imagine a robot that delivers materials in a factory from several possible source locations to several destinations. The robot operates over a long work shift and has a probabilistic model of when requests to deliver materials will arrive. Formulation 10.2 can be used to deﬁne a problem in which the goal is to minimize the average amount of time that materials wait to be delivered. This strategy should not depend on the length of the shift; therefore, an inﬁnite number of stages is reasonable. If the shift is too short, the robot may focus only on one delivery, or it may not even have enough time to accomplish that.

Discounted cost In Formulation 10.2, the cost functional in Item 6 must be deﬁned carefully to ensure that ﬁnite values are always obtained, even though the number of stages tends to inﬁnity. The discounted cost model provides one simple way to achieve this by rapidly decreasing costs in future stages. Its deﬁnition is

based on the standard geometric series. For any real parameter α ∈ (0, 1),

lim

K→∞

αk =

（

＼

k=0

1 1 − α

. (10.66)

The simplest case, α = 1/2, yields 1+1/2+1/4+1/8+ , which clearly converges to 2.

· · ·

Now let α (0, 1) denote a discount factor, which is applied in the deﬁnition of a cost functional:

∈

L(x˜, u˜, θ˜) = lim

K→∞

αk_l(x_k, uk, θk)

（

＼

k=0

. (10.67)

Let lk denote the cost, l(xk, uk, θk), received at stage k. For convenience in this setting, the ﬁrst stage is k = 0, as opposed to k = 1, which has been used previously. As the maximum stage, K, increases, the diminished importance of costs far in the future can easily be observed, as indicated in Figure 10.10.

The rate of cost decrease depends strongly on α. For example, if α = 1/2, the costs decrease very rapidly. If α = 0.999, the convergence to zero is much slower. The trade-oﬀ is that with a large value of α, more stages are taken into account, and the designed plan is usually of higher quality. If a small value of α is used, methods such as value iteration converge much more quickly; however, the solution quality may be poor because of “short sightedness.”

Stage L∗K

K = 0 l0

K = 1 l0 + αl1

K = 2 l0 + αl1 + α2l2

K = 3 l0 + αl1 + α2l2 + α3l3

K = 4 l0 + αl1 + α2l2 + α3l3 + α4l4

...

Figure 10.10: The cost magnitudes decease exponentially over the stages. The term l(xk, uk, θk) in (10.67) assumes diﬀerent values depending on xk, uk,

and θk. Since there are only a ﬁnite number of possibilities, they must be bounded by some positive constant c.1 Hence,

lim

K→∞

αk_l(x_k, uk, θk)

（

＼

k=0

lim

K→∞

≤

α_k_c ≤

（

＼

k=0

1 − α

, (10.68)

which means that L(x˜, u˜, θ˜) is bounded from above, as desired. A similar lower bound can be constructed, which ensures that the resulting total cost is always ﬁnite.

Average cost-per-stage An alternative to discounted cost is to use the average cost-per-stage model, which keeps the cumulative cost ﬁnite by dividing out the total number of stages:

L(x˜, u˜, θ˜) = lim

K 1

l(xk, uk, θk)

（ 1 −

K→∞

＼

k=0

. (10.69)

k=0

Using the maximum per-stage cost bound c, it is clear that (10.69) grows no larger than c, even as K . This model is sometimes preferable because the cost does not depend on an arbitrary parameter, α.

→ ∞

- 1. Solution Techniques

Straightforward adaptations of the value and policy iteration methods of Section

10.2 exist for inﬁnite-horizon problems. These will be presented here; however, it is important to note that many other important issues exist regarding their convergence and numerical stability [96]. There are several other variants of these algorithms that are too involved to cover here but nevertheless are important because they address many of these additional issues. The main point in this section is to understand the simple relationship to the problems considered so far in Sections 10.1 and 10.2.

1The state space X may even be infinite, but this requires that the set of possible costs is bounded.

Value iteration for discounted cost A backward value iteration solution will be presented that follows naturally from the method given in Section 10.2.1. For notational convenience, let the ﬁrst stage be designated as k = 0 so that αk−1 may be replaced by αk. In the probabilistic case, the expected optimal cost-to-go is

G∗(x) = lim

K→∞

min

u˜

（

（Eθ˜

αk_l(x_k, uk, θk)

「

k=1

l_＼

. (10.70)

The expectation is taken over all nature histories, each of which is an inﬁnite sequence of nature actions. The corresponding expression for the nondeterministic case is

（

G∗(x) = lim

K→∞

min

u˜

max

θ˜

αk_l(x_k, uk, θk)

k=1

__＼

. (10.71)

Since the probabilistic case is more common, it will be covered here. The nondeterministic version is handled in a similar way (see Exercise 17). As before, backward value iterations will be performed because they are simpler to express. The discount factor causes a minor complication that must be ﬁxed to make the dynamic programming recurrence work properly.

One diﬃculty is that the stage index now appears in the cost function, in the form of αk. This means that the shift-invariant property from Section 2.3.1.1 is no longer preserved. We must therefore be careful about assigning stage indices. This is a problem because for backward value iteration the ﬁnal stage index has been unknown and unimportant.

Consider a sequence of discounted decision-making problems, by increasing the maximum stage index: K = 0, K = 1, K = 2, . . .. Look at the neighboring cost expressions in Figure 10.10. What is the diﬀerence between ﬁnding the optimal cost-to-go for the K + 1-stage problem and the K-stage problem? In Figure 10.10 the last four terms of the cost for K = 4 can be obtained by multiplying all terms for K = 3 by α and adding a new term, l0. The only diﬀerence is that the stage indices need to be shifted by one on each li that was borrowed from the K = 3 case. In general, the optimal costs of a K-stage optimization problem can serve as the optimal costs of the K + 1-stage problem if they are ﬁrst multiplied by α. The K + 1-stage optimization problem can be solved by optimizing over the sum of the ﬁrst-stage cost plus the optimal cost for the K-stage problem, discounted by α.

This can be derived using straightforward dynamic programming arguments as follows. Suppose that K is ﬁxed. The cost-to-go can be expressed recursively for k from 0 to K as

G∗k(xk) = min

uk ∈U (xk )

JEθk

「αk_l(x_k, uk, θk) + G∗k+1(xk+1)l＼, (10.72)

in which xk+1 = f(xk, uk, θk). The problem, however, is that the recursion depends on k through αk, which makes it appear nonstationary.

The idea of using neighboring cost values as shown in Figure 10.10 can be applied by making a notational change. Let JK∗ −k(xk) = α−k_G∗_k(xk). This reverses the direction of the stage indices to avoid specifying the ﬁnal stage and also scales by α−k to correctly compensate for the index change. Substitution into (10.72)

yields

αk_J_K∗ k(xk) = min

−

uk ∈U (xk )

JEθk

「αk_l(x_k, uk, θk) + αk+1JK∗ −k−1(xk+1)l＼. (10.73)

Dividing by αk and then letting i = K − k yields

u ) J 「 l＼

Ji∗(xk) = min Eθk l(xk, uk, θk) + αJi∗ 1(xk+1) , (10.74)

−

k ∈U (xk

in which Ji∗ represents the expected cost for a ﬁnite-horizon discounted problem in which K = i. Note that (10.74) expresses Ji∗ in terms of Ji∗ 1, but xk is given, and the right-hand side uses xk+1. The indices appear to ru in opposite directions

−

because this is simply backward value iteration with a notational change that reverses some of the indices. The particular stage indices of xk and xk+1 are not important in (10.74), as long as xk+1 = f(xk, uk, θk) (for example, the substitutions

x = xk, x′ = xk+1, u = uk, and θ = θk can be safely made).

Value iteration proceeds by ﬁrst letting J0∗(x0) = 0 for all x X. Successive cost-to-go functions are computed by iterating (10.74) over the state space. Un-

∈

der the cycle-avoiding assumptions of Section 10.2.1, the convergence is usually asymptotic due to the inﬁnite horizon. The discounting gradually causes the cost diﬀerences to diminish until they are within the desired tolerance. The stationary form of the dynamic programming recurrence, which is obtained in the limit as i tends to inﬁnity, is

J∗(x) = min Eθk l(x, u, θ) + αJ∗(f(x, u, θ)) . (10.75)

J 「 l＼

u∈U (x)

If the cost terms do not depend on nature, then the simpliﬁed form is

J∗(x) = min

u∈U (x)

l(x, u) + α J∗(x′)P(x′|x, u) . (10.76)

x′∈X

J - ＼

As explained in Section 10.2.1, the optimal action, π∗(x), is assigned as the u U(x) that satisﬁes (10.75) or (10.76) at x.

∈

Policy iteration for discounted cost The policy iteration method may al- ternatively be applied to the probabilistic discounted-cost problem. Recall the method given in Figure 10.4. The general approach remains the same: A search is conducted over the space of plans by solving a linear system of equations in each iteration. In Step 2, (10.53) is replaced by

Jπ (x) = l(x, u) + α Jπ (x′)P(x′|x, u), (10.77)

x′∈X

which is a special form of (10.76) for evaluating a ﬁxed plan. In Step 3, (10.54) is replaced by

π′(x) = argmin Jl(x, u) + α - Jπ (x′)P(x′|x, u)＼. (10.78)

u∈U (x)

x′∈X

u∈U (x)

x′∈X

Using these alterations, the policy iteration algorithm proceeds in the same way as in Section 10.2.2.

Solutions for the average cost-per-stage model A value iteration algorithm for the average cost model can be obtained by simply neglecting to divide by K. Selecting actions that optimize the total cost also optimizes the average cost as the number of stages approaches inﬁnity. This may cause costs to increase toward

; however, only a ﬁnite number of iterations can be executed in practice.

±∞

The backward value iterations of Section 10.2.1 can be followed with very little modiﬁcation. Initially, let G∗(xF ) = 0 for all xF X. The value iterations are computed using the standard form

∈

Gk∗ (xk) = min

uk ∈U (xk

) （ θ∈Θ-(x ,u )

(l(xk, uk, θk) + G∗k+1(f (xk, uk, θk))\P(θk|xk, uk)1.

k k

(10.79)

The iterations continue until convergence occurs. To determine whether a solution of suﬃcient quality has been obtained, a reasonable criterion for is

max Gk∗ (x)/N Gk∗+1(x)/(N 1) < ǫ, (10.80)

JII − − II＼

x∈X

in which ǫ is the error tolerance and N is the number of value iterations that have been completed (it is required in (10.80) that N > 1). Once (10.80) has been satisﬁed, the iterations can be terminated.

A numerical problem may exist with the growing values obtained for G∗(x). This can be alleviated by periodically reducing all values by some constant factor to ensure that the numbers ﬁt within the allowable ﬂoating point range. In [96], a method called relative value iteration is presented, which selects one state, s X, arbitrarily and expresses the cost-to-go values by subtracting oﬀ the cost at s. This trims down all values simultaneously to keep them bounded while still maintaining the convergence properties of the algorithm.

∈

Policy iteration can alternatively be performed by using the method given in Figure 10.4 with only minor modiﬁcation.

Infinite-Horizon Problems

Infinite-Horizon Problems

results matching ""

No results matching ""