Continuous-Time Dynamic Programming

Dynamic programming has been a recurring theme throughout most of this book. So far, it has always taken the form of computing optimal cost-to-go (or cost-to- come) functions over some sequence of stages. Both value iteration and Dijkstra- like algorithms have emerged. In computer science, dynamic programming is a fundamental insight in the development of algorithms that compute optimal so- lutions to problems. In its original form, however, dynamic programming was developed to solve the optimal control problem [84]. In this setting, a discrete set of stages is replaced by a continuum of stages, known as time. The dy- namic programming recurrence is instead a partial diﬀerential equation, called the Hamilton-Jacobi-Bellman (HJB) equation. The HJB equation can be solved using numerical algorithms; however, in some cases, it can be solved analytically.3 Section 15.2.2 brieﬂy describes an analytical solution in the case of linear systems. Section 15.2.3 covers Pontryagin’s minimum principle, which can be derived from the dynamic programming principle, and generalizes the optimization performed in Hamiltonian mechanics (recall Section 13.4.4).

- 1. Hamilton-Jacobi-Bellman Equation

The HJB equation is a central result in optimal control theory. Many other prin- ciples and design techniques follow from the HJB equation, which itself is just a statement of the dynamic programming principle in continuous time. A proper derivation of all forms of the HJB equation would be beyond the scope of this

3It is often surprising to computer scientists that dynamic programming in this case does not yield an algorithm. It instead yields a closed-form solution to the problem.

book. Instead, a time-invariant formulation that is most relevant to planning will be given here. Also, an informal derivation will follow, based in part on [95].

- - 1. The discrete case

Before entering the continuous realm, the concepts will ﬁrst be described for dis- crete planning, which is often easier to understand. Recall from Section 2.3 that if X, U, and the stages are discrete, then optimal planning can be performed by using value iteration or Dijkstra’s algorithm on the search graph. The stationary, optimal cost-to-go function G∗ can be used as a navigation function that encodes the optimal feedback plan. This was suggested in Section 8.2.2, and an example was shown in Figure 8.3.

Suppose that G∗ has been computed under Formulation 8.1 (or Formulation 2.3). Let the state transition equation be denoted as

x′ = fd(x, u). (15.5)

The dynamic programming recurrence for G∗ is

G∗(x) = min

u∈U (x)

{l(x, u) + G∗(x′)} , (15.6)

which may already be considered as a discrete form of the Hamilton-Jacobi- Bellman equation. To gain some insights into the coming concepts, however, some further manipulations will be performed.

Let u∗ denote the optimal action that is applied in the min of (15.6). Imagine that u∗ is hypothesized as the optimal action but needs to be tested in (15.6) to

make sure. If it is truly optimal, then

G∗(x) = l(x, u∗) + G∗(fd(x, u∗)). (15.7)

This can already be considered as a discrete form of the Pontryagin minimum principle, which will appear in Section 15.2.3. By rearranging terms, a nice inter- pretation is obtained:

G∗(fd(x, u∗)) − G∗(x) = −l(x, u∗). (15.8)

In a single stage, the optimal cost-to-go drops by l(x, u∗) when G∗ is used as a navigation function (multiply (15.8) by 1). The optimal single-stage cost is re- vealed precisely when taking one step toward the goal along the optimal path. This incremental change in the cost-to-go function while moving in the best direction forms the basis of both the HJB equation and the minimum principle.

−

- - 1. The continuous case

Now consider adapting to the continuous case. Suppose X and U are both con- tinuous, but discrete stages remain, and verify that (15.5) to (15.8) still hold true.

Their present form can be used for any system that is approximated by discrete stages. Suppose that the discrete-time model of Section 14.2.2 is used to approxi- mate a system x˙ = f(x, u) on a state space X that is a smooth manifold. In that model, U was discretized to Ud, but here it will be left in its original form. Let ∆t represent the time discretization.

The HJB equation will be obtained by approximating (15.6) with the discrete- time model and letting ∆t approach zero. The arguments here are very informal; see [95, 570, 912] for more details. Using discrete-time approximation, the dynamic programming recurrence is

G∗(x) = min

u∈U (x)

{ld(x, u) + G∗(x′)} , (15.9)

in which ld is a discrete-time approximation to the cost that accumulates over stage k and is given as

ld(x, u) ≈ l(x, u)∆t. (15.10)

It is assumed that as ∆t approaches zero, the total discretized cost converges to the integrated cost of the continuous-time formulation.

Using the linear part of a Taylor series expansion about x, the term G∗(x′) can

be approximated as

G∗(x′) ≈ G∗(x) +

i=1

∂G∗

∂xi

fi(x, u)∆t. (15.11)

This approximates G∗(x′) by its tangent plane at x. Substitution of (15.11) and (15.10) into (15.9) yields

G∗(x) min

≈

u∈U (x)

l(x, u) ∆t + G∗(x) +

（

i=1

∂G∗

∂xi

fi(x, u) ∆t_

. (15.12)

Subtracting G∗(x) from both sides of (15.12) yields

min

u∈U (x)

l(x, u) ∆t +

（

i=1

∂G∗

∂xi

fi(x, u) ∆t_

≈ 0. (15.13)

Taking the limit as ∆t approaches zero and then dividing by ∆t yields the HJB equation:

（

min

u∈U (x)

l(x, u) +

i=1

∂G∗

∂xi

fi(x, u)_

= 0. (15.14)

Compare the HJB equation to (15.6) for the discrete-time case. Both indicate how the cost changes when moving in the best direction. Substitution of u∗ for the optimal action into (15.14) yields

-i=1

∂G∗

∂xi

fi(x, u∗) = −l(x, u∗). (15.15)

This is just the continuous-time version of (15.8). In the current setting, the left side indicates the derivative of the cost-to-go function along the direction obtained by applying the optimal action from x.

The HJB equation, together with a boundary condition that speciﬁes the ﬁnal- stage cost, suﬃciently characterizes the optimal solution to the planning problem. Since it is expressed over the whole state space, solutions to the HJB equation yield optimal feedback plans. Unfortunately, the HJB equation cannot be solved analytically in most settings. Therefore, numerical techniques, such as the value iteration method of Section 14.5, must be employed. There is, however, an im- portant class of problems that can be directly solved using the HJB equation; see Section 15.2.2.

- - 1. Variants of the HJB equation

Several versions of the HJB equation exist. The one presented in (15.14) is suitable for planning problems such as those expressed in Chapter 14. If the cost-to-go functions are time-dependent, then the HJB equation is

min

u∈U (x)

（l(x, u, t) +

∂G∗

∂t

i=1

∂G∗

∂xi

fi(x, u, t)_

= 0, (15.16)

and G∗ is a function of both x and t. This can be derived again using a Taylor expansion, but with x and t treated as the variables. Most textbooks on optimal control theory present the HJB equation in this form or in a slightly diﬀerent form by pulling ∂G∗/∂t outside of the min and moving it to the right of the equation:

min

u∈U (x)

l(x, u, t) +

（

i=1

∂G∗

∂xi

fi(x, u, t)_

∂G∗

−

∂t

. (15.17)

In diﬀerential game theory, the HJB equation generalizes to the Hamilton- Jacobi-Isaacs (HJI) equations [59, 477]. Suppose that the system is given as (13.203) and a zero-sum game is deﬁned using a cost term of the form l(x, u, v, t). The HJI equations characterize saddle equilibria and are given as

min

max

（l(x, u, v, t) +

∂G∗

n ∂G∗

fi(x, u, v, t)_

= 0 (15.18)

and

u∈U (x) v∈V (x)

（

v∈V (x) u∈U (x)

∂t

∂G∗

∂t

i=1

∂xi

∂G∗ _

∂xi

max

min

l(x, u, v, t) +

- fi(x, u, v, t)

i=1

= 0. (15.19)

i=1

There are clear similarities between these equations and (15.16). Also, the swap- ping of the min and max operators resembles the deﬁnition of saddle points in Section 9.3.

- 1. Linear-Quadratic Problems

This section brieﬂy describes a problem for which the HJB equation can be directly solved to yield a closed-form expression, as opposed to an algorithm that computes numerical approximations. Suppose that a linear system is given by (13.37), which requires specifying the matrices A and B. The task is to design a feedback plan that asymptotically stabilizes the system from any initial state. This is an inﬁnite- horizon problem, and no termination action is applied.

An optimal solution is requested with respect to a cost functional based on matrix quadratic forms. Let Q be a nonnegative deﬁnite4 n × n matrix, and let R be a positive deﬁnite n × n matrix. The quadratic cost functional is deﬁned as

1 ∞

r （）

L(x˜, u˜) = x(t)T Qx(t) + u(t)T Ru(t) dt. (15.20)

To guarantee that a solution exists that yields ﬁnite cost, several assumptions must be made on the matrices. The pair (A, B) must be stabilizable, and (A, Q) must be detectable; see [28] for speciﬁc conditions and a full derivation of the solution presented here.

Although it is not done here, the HJB equation can be used to derive the

algebraic Riccati equation,

SA + AT S − SBR−1BT S + Q = 0, (15.21)

in which all matrices except S were already given. Methods exist that solve for S, which is a unique solution in the space of nonnegative deﬁnite n n matrices.

The linear vector ﬁeld

x˙ = A − BR−1BT S x (15.22)

（）

is asymptotically stable (the real parts of all eigenvalues of the matrix are nega- tive). This vector ﬁeld is obtained if u is selected using a feedback plan π deﬁned as

π(x) = −R−1BT Sx. (15.23)

The feedback plan π is in fact optimal, and the optimal cost-to-go is simply

G∗(x) = 1 xT Sx. (15.24)

Thus, for linear systems with quadratic cost, an elegant solution exists without resorting to numerical approximations. Unfortunately, the solution techniques do not generalize to nonlinear systems or linear systems among obstacles. Hence, the planning methods of Chapter 14 are justiﬁed.

However, many variations and extensions of the solutions given here do exist, but only for other problems that are expressed as linear systems with quadratic

4Nonnegative definite means xT Qx ≥ 0 for all x ∈ R, and positive definite means xT Rx > 0 for all x ∈ Rn.

cost. In every case, some variant of Riccati equations is obtained by application of the HJB equation. Solutions to time-varying systems are derived in [28]. If there is Gaussian uncertainty in predictability, then the linear-quadratic Gaussian (LQG) problem is obtained [564]. Linear-quadratic problems and solutions even exist for diﬀerential games of the form (13.204) [59].

- 1. Pontryagin’s Minimum Principle

Pontryagin’s minimum principle5 is closely related to the HJB equation and pro- vides conditions that an optimal trajectory must satisfy. Keep in mind, however, that the minimum principle provides necessary conditions, but not sufficient con- ditions, for optimality. In contrast, the HJB equation oﬀered suﬃcient conditions. Using the minimum principle alone, one is often not able to conclude that a tra- jectory is optimal. In some cases, however, it is quite useful for ﬁnding candidate optimal trajectories. Any trajectory that fails to satisfy the minimum principle cannot be optimal.

To understand the minimum principle, we ﬁrst return to the case of discrete planning. As mentioned previously, the minimum principle is essentially given by (15.7). This can be considered as a specialization of the HJB equation to the special case of applying the optimal action u∗. This causes the min to disappear, but along with it the global properties of the HJB equation also vanish. The minimum principle expresses conditions along the optimal trajectory, as opposed to the cost-to-go function over the whole state space. Therefore, it can at best assure local optimality in the space of possible trajectories.

The minimum principle for the continuous case is essentially given by (15.15), which is the continuous-time counterpart to (15.7). However, it is usually ex- pressed in terms of adjoint variables and a Hamiltonian function, in the spirit of Hamiltonian mechanics from Section 13.4.4.

Let λ denote an n-dimensional vector of adjoint variables, which are deﬁned as

∂G∗

λi =

The Hamiltonian function is deﬁned as

∂xi

. (15.25)

H(x, u, λ) = l(x, u) + λi_f_i(x, u), (15.26)

i=1

which is exactly the expression inside of the min of the HJB equation (15.14) after using the adjoint variable deﬁnition from (15.25). This can be compared to the Hamiltonian given by (13.192) in Section 13.4.4 (p from that context becomes λ

5This is often called Pontryagin’s maximum principle, because Pontryagin originally defined it as a maximization [801]. The Hamiltonian used in most control literature is negated with respect to Pontryagin’s Hamiltonian; therefore, it becomes minimized. Both names are in common use.

here). The two are not exactly the same, but they both are motivated by the same basic principles.

Under the execution of the optimal action trajectory u˜∗, the HJB equation

implies that

H(x(t), u∗(t), λ(t)) = 0 (15.27)

for all t 0. This is just an alternative way to express (15.15). The fact that H remains constant appears very much like a conservation law, which was the basis of Hamiltonian mechanics in Section 13.4.4. The use of the Hamiltonian in the minimum principle is more general.

≥

Using the HJB equation (15.14), the optimal action is given by

u∗(t) = argmin H(x(t), u(t), λ(t)) . (15.28)

{ }

u∈U (x)

In other words, the Hamiltonian is minimized precisely at u(t) = u∗(t).

The missing piece of information so far is how λ evolves over time. It turns out that a system of the form

λ˙ = g(x, λ, u∗) (15.29)

can be derived by diﬀerentiating the Hamiltonian (or, equivalently, the HJB equa- tion) with respect to x. This yields two coupled systems, x˙ = f(x, u∗) and (15.29). These can in fact be interpreted as a single system in a 2n-dimensional phase space, in which each phase vector is (x, λ). This is analogous to the phase interpretation in Section 13.4.4 for Hamiltonian mechanics, which results in (13.198).

Remember that λ is deﬁned in (15.25) just to keep track of the change in G∗. It would be helpful to have an explicit form for (15.29). Suppose that u∗ is selected by a feedback plan to yield u∗ = π∗(x). In this case, the Hamiltonian can be interpreted as a function of only x and λ. Under this assumption, diﬀerentiating the Hamiltonian (15.26) with respect to xi yields

∂l(x, π∗(x))

n ∂λ

n ∂f (x, π∗(x))

- j f (x, π∗(x)) + - λ

∂x

j=1

j . (15.30)

j=1

This validity of this diﬀerentiation requires a technical lemma that asserts that the derivatives of π(x) can be disregarded (see Lemma 3.3.1 of [95]). Also, it will be assumed that U is convex in the arguments that follow, even though there exist proofs of the minimum principle that do not require this.

The second term in (15.30) is actually λ˙ i, although it is hard to see at ﬁrst.

The total diﬀerential of λi with respect to the state is

n ∂λ

dλ = - i dx . (15.31)

Dividing both sides by dt yields

j=1

∂xj

dλ n ∂λ dx

n ∂λ

i = - i j

∂x

j=1

= - i x˙

j=1

. (15.32)

j=1

Each x˙ j is given by the state transition equation:

x˙ j = fj (x, π∗(x)). Therefore,

dλ d ∂G∗

dt ∂x

∂x

n ∂λ

λ˙ = i =

= - i f (x, π∗(x)). (15.33)

j=1

Substituting (15.33) into (15.30) and setting the equation to zero (because the Hamiltonian is zero along the optimal trajectory) yields

∂l(x, π∗(x))

∂xi

λ˙ i

j=1

∂fj (x, π∗(x))

= 0. (15.34)

∂xi

Solving for λ˙ i yields

λ˙ i

∂l(x, π∗(x))

−

∂xi

j=1

∂fj (x, π∗(x))

. (15.35)

∂xi

Conveniently, this is the same as

λ˙ i

∂H

= −∂x , (15.36)

which yields the adjoint transition equation, as desired.

The transition equations given by x˙ = f(x, u) and (15.36) specify the evolution of the system given by the minimum principle. These are analogous to Hamilton’s equations (13.198), which were given in Section 13.4.4. The generalized momentum in that context becomes the adjoint variables here.

When applying the minimum principle, it is usually required to use the fact that the optimal action at all times must satisfy (15.28). Often, this is equivalently expressed as

H(x(t), u∗(t), λ(t)) ≤ H(x(t), u(t), λ(t)), (15.37)

which indicates that the Hamiltonian increases or remains the same whenever deviation from the optimal action occurs (the Hamiltonian cannot decrease).

Example 15.4 (Optimal Planning for the Double Integrator) Recall the

double integrator system from Example 13.3. Let q¨ = u, C = R, and U =

[−1, 1] ∪ {uT }. Imagine a particle that moves in R. The action is a force in either

direction and has at most unit magnitude. The state transition equation is x˙ 1 = x2 and x˙ 2 = u, and X = R2. The task is to perform optimal motion planning between any two states xI , xG X. From a given initial state xI , a goal state xG must be

∈

reached in minimum time. The cost functional is deﬁned in this case as l(x, u) = 1 for all x X and and u U such that u = uT .

∈ ∈ /

Using (15.26), the Hamiltonian is deﬁned as

H(x, u, λ) = 1 + λ1x2 + λ2u. (15.38)

The optimal action trajectory is obtained from (15.28) as

u∗(t) = argmin 1 + λ1(t)x2(t) + λ2(t)u(t) . (15.39)

{ }

u∈[−1_,_1]

If λ2(t) < 0, then u∗(t) = 1, and if λ2(t) > 0, then u∗(t) = −1. Thus, the action

may be assigned as u∗(t) = −sgn(λ2(t)), if λ2(t) 0. Note that these two cases

are the “bangs” of the bang-bang control from Section 14.6.3, and they are also

the extremal actions used for the planning algorithm in Section 14.4.1. At the boundary case in which λ2(t) = 0, any action in [ 1, 1] may be chosen.

−

The only remaining task is to determine the values of the adjoint variables over time. The adjoint transition equation is obtained from (15.36) as λ˙ 1 = 0 and λ˙ 2 = λ1. The solutions are λ1(t) = c1 and λ2(t) = c2 c1t, in which c1 and c2 are

− −

constants that can be determined at t = 0 from (15.38) and (15.39). The optimal action depends only on the sign of λ2(t). Since its solution is the equation of a line, it can change signs at most once. Therefore, there are four possible kinds of solutions, depending on the particular xI and xG:

Pure acceleration, u∗(t) = 1, is applied for all time.
Pure deceleration, u∗(t) = −1, is applied for all time.
Pure acceleration is applied up to some time t′ and is followed immediately by pure deceleration until the ﬁnal time.
Pure deceleration is applied up to some time t′ followed immediately by pure acceleration until the ﬁnal time.

For the last two cases, t′ is often called the switching time, at which point a dis- continuity in u˜∗ occurs. These two are bang-bang solutions, which were described

in Section 14.6.3. ．

This was one of the simplest possible examples, and the optimal solution was easily found because the adjoint variables are linear functions of time. Section 15.3 covers optimal solutions for the Dubins car, the Reeds-Shepp car, and the diﬀer- ential drive, all of which can be established using the minimum principle combined with some geometric arguments. As systems become more complicated, such anal- ysis is unfortunately too diﬃcult. In these cases, sampling-based methods, such as those of Chapter 14, must be used to determine optimal trajectories.

One common complication is the existence of singular arcs along the solution trajectory. These correspond to a degeneracy in H with respect to u over some duration of time. This could be caused, for example, by having H independent of

u. In Example 15.4, H became independent of u when λ2(t) = 0; however, there was no singular arc because this could only occur for an instant of time. If the duration had been longer, then there would be an interval of time over which the optimal action could not be determined. In general, if the Hessian (recall deﬁnition

from (8.48)) of H with respect to u is a positive deﬁnite matrix, then there are no singular arcs (this is often called the Legendre-Clebsch condition). The minimum principle in this case provides a suﬃcient condition for local optimality in the space of possible state trajectories. If the Hessian is not positive deﬁnite for some interval [t1, t2] with t1 < t2, then additional information is needed to determine the optimal trajectory over the singular arc from x∗(t1) to x∗(t2).

Note that all of this analysis ignores the existence of obstacles. There is noth- ing to prevent the solutions from attempting to enter an obstacle region. The action set U(x) and cost l(x, u) can be adjusted to account for obstacles; however, determining an optimal solution from the minimum principle becomes virtually impossible, except in some special cases.

There are other ways to derive the minimum principle. Recall from Section

13.4.4 that Hamilton’s equations can be derived from the Euler-Lagrange equa- tion. It should not be surprising that the minimum principle can also be derived using variational principles [95, 789]. The minimum principle can also be inter- preted as a form of constrained optimization. This yields the interpretation of λ as Lagrange multipliers. A very illuminating reference for further study of the minimum principle is Pontryagin’s original works [801].

Time optimality Interesting interpretations of the minimum principle exist for the case of optimizing the time to reach the goal [424, 903]. In this case, l(x, u) = 1 in (15.26), and the cost term can be ignored. For the remaining portion, let λ be deﬁned as

∂G∗ λi = − ∂x

, (15.40)

instead of using (15.25). In this case, the Hamiltonian can be expressed as

H(x, u, λ) = - λ f (x, u) = ／−∂G∗ , f(x, u)\ , (15.41)

i=1

∂x

which is an inner product between f(x, u) and the negative gradient of G∗. Using (15.40), the Hamiltonian should be maximized instead of minimized (this is equiv- alent to Pontryagin’s original formulation [801]). An inner product of two vectors increases as their directions become closer to parallel. Optimizing (15.41) amounts

to selecting u so that x˙ is as close as possible to the direction of steepest descent

of G∗. This is nicely interpreted by considering how the boundary of the reachable set R(x0, U, t) propagates through X. By deﬁnition, the points on ∂R(x0, U, t) must correspond to time-optimal trajectories. Furthermore, ∂R(x0, U, t) can be interpreted as a propagating wavefront that is perpendicular to −∂G∗/∂x. The

minimum principle simply indicates that u should be chosen so that x˙ points into

the propagating boundary, as close to being orthogonal as possible [424].

Continuous-Time Dynamic Programming

Continuous-Time Dynamic Programming

results matching ""

No results matching ""