Sequential Game Theory
So far in the chapter, the sequential decision-making process has only involved a game against nature. In this section, other decision makers are introduced to the game. The single-stage games and their equilibrium concepts from Sections
9.3 and 9.4 will be extended into a sequence of games. Section 10.5.1 introduces sequential zero-sum games that are represented using game trees, which help vi- sualize the concepts. Section 10.5.2 covers sequential zero-sum games using the state-space representation. Section 10.5.3 briefly covers extensions to other games, including nonzero-sum games and games that involve nature. The formulations in this section will be called sequential game theory. Another common name for them is dynamic game theory [59]. If there is a continuum of stages, which is briefly con- sidered in Section 13.5, then differential game theory is obtained [59, 477, 783, 985].
- Game Trees
In most literature, sequential games are formulated in terms of game trees. A state-space representation, which is more in alignment with the representations used in this chapter, will be presented in Section 10.5.2. The tree representation is commonly referred to as the extensive form of a game (as opposed to the normal form, which is the cost matrix representation used in Chapter 9). The represen- tation is helpful for visualizing many issues in game theory. It is perhaps most helpful for visualizing information states; this aspect of game trees will be deferred until Section 11.7, after information spaces have been formally introduced. Here, game trees are presented for cases that are simple to describe without going deeply into information spaces.
Before a sequential game is introduced, consider representing a single-stage game in a tree form. Recall Example 9.14, which is a zero-sum, 3 3 matrix game. It can be represented as a game tree as shown in Figure 10.12. At the root, P1 has three choices. At the next level, P2 has three choices. Based on the choices by both, one of nine possible leaves will be reached. At this point, a cost is obtained, which is written under the leaf. The entries of the cost matrix, (9.53), appear across the leaves of the tree. Every nonleaf vertex is called a decision vertex: One player must select an action.
×
There are two possible interpretations of the game depicted in Figure 10.12:
- Before it makes its decision, P2 knows which action was applied by P1. This does not correspond to the zero-sum game formulation introduced in Section
9.3 because P2 seems as powerful as nature. In this case, it is not equivalent to the game in Example 9.14.
- P2 does not know the action applied by P1. This is equivalent to assum- ing that both P1 and P2 make their decisions at the same time, which is consistent with Formulation 9.7. The tree could have alternatively been represented with P2 acting first.
Now imagine that P1 and P2 play a sequence of games. A sequential version of the zero-sum game from Section 9.3 will be defined by extending the game tree idea given so far to more levels. This will model the following sequential game:
Formulation 10.3 (Zero-Sum Sequential Game in Tree Form)
- Two players, P1 and P2, take turns playing a game. A stage as considered previously is now stretched into two substages, in which each player acts individually. It is usually assumed that P1 always starts, followed by P2, then P1 again, and so on. Player alternations continue until the game ends. The model reflects the rules of many popular games such as chess or poker. Let = 1, . . . , K denote the set of stages at which P1 and P2 both take a turn.
K { }
- As each player takes a turn, it chooses from a nonempty, finite set of actions. The available set could depend on the decision vertex.
- At the end of the game, a cost for P1 is incurred based on the sequence of actions chosen by each player. The cost is interpreted as a reward for P2.
- The amount of information that each player has when making its decision must be specified. This is usually expressed by indicating what portions of the action histories are known. For example, if P1 just acted, does P2 know its choice? Does it know what action P1 chose in some previous stage?
The game tree can now be described in detail. Figure 10.13 shows a particular example for two stages (hence, K = 2 and = 1, 2 ). Every vertex corresponds to a point at which a decision needs to be made by one player. Each edge emanating from a vertex represents an action. The root of the tree indicates the beginning of the game, which usually means that P1 chooses an action. The leaves of the tree represent the end of the game, which are the points at which a cost is received. The cost is usually shown below each leaf. One final concern is to specify the information available to each player, just prior to its decision. Which actions among those previously applied by itself or other players are known?
K { }
For the game tree in Figure 10.13, there are two players and two stages. There- fore, there are four levels of decision vertices. The action sets for the players are
U = V = {L, R}, for “left” and “right.” Since there are always two actions, a
P1 acts
P2 acts
Cost 4 2 0
0 1 0 3 2 2 3 1 2 4 1 3 2
Figure 10.13: A two-player, two-stage game expressed using a game tree.
binary tree is obtained. There are 16 possible outcomes, which correspond to all pairwise combinations of four possible two-stage plans for each player.
For a single-stage game, both deterministic and randomized strategies were de- fined to obtain saddle points. Recall from Section 9.3.3 that randomized strategies were needed to guarantee the existence of a saddle point. For a sequential game, these are extended to deterministic and randomized plans, respectively. In Section 10.1.3, a (deterministic) plan was defined as a mapping from the state space to an action space. This definition can be applied here for each player; however, we must determine what is a “state” for the game tree. This depends on the information that each player has available when it plays.
A general framework for representing information in game trees is covered in Section 11.7. Three simple kinds of information will be discussed here. In every case, each player knows its own actions that were applied in previous stages. The differences correspond to knowledge of actions applied by the other player. These define the “state” that is used to make the decisions in a plan.
The three information models considered here are as follows.
Alternating play: The players take turns playing, and all players know all actions that have been previously applied. This is the situation obtained, for example, in a game of chess. To define a plan, let N1 and N2 denote the set of all vertices from which P1 and P2 must make a decision, respectively. In Figure 10.13, N1 is the set of dark vertices and N2 is the set of white vertices. Let U(n1) and V (n2) be the action spaces for P1 and P2, respectively, which depend on the vertex. A (deterministic) plan for P1 is defined as a function,
π1, on N1 that yields an action u ∈ U(n1) for each n1 ∈ N1. Similarly, a
(deterministic) plan for P2 is defined as a function, π2, on N2 that yields an action v ∈ V (n2) for each n2 ∈ N2. For the randomized case, let W (n1) and
Z(n2) denote the sets of all probability distributions over U(n1) and V (n2), respectively. A randomized plan for P1 is defined as a function that yields
some w ∈ W (n1) for each n1 ∈ N1. Likewise, a randomized plan for P2 is
defined as a function that maps from N2 into Z(n2).
Stage-by-stage: Each player knows the actions applied by the other in all
previous stages; however, there is no information about actions chosen by others in the current stage. This effectively means that both players act simultaneously in each stage. In this case, a deterministic or randomized plan for P1 is defined as in the alternating play case; however, plans for P2 are defined as functions on N1, instead of N2. This is because at the time it makes its decision, P2 has available precisely the same information as P1. The action spaces for P2 must conform to be dependent on elements of N1,
instead of N2; otherwise, P2 would not know what actions are available. Therefore, they are defined as V (n1) for each n1 ∈ N1.
Open loop: Each player has no knowledge of the previous actions of the other. They only know how many actions have been applied so far, which indicates the stage of the game. Plans are defined as functions on , the set of stages, because the particular vertex is not known. Note that an open-loop plan is just a sequence of actions in the deterministic case (as in Section 2.3) and a sequence of probability distributions in the randomized case. Again, the action spaces must conform to the information. Thus, they are U(k) and
K
V (k) for each k ∈ K.
For a single-stage game, as in Figure 10.12, the stage-by-stage and open-loop models are equivalent.
- Determining a security plan
The notion of a security strategy from Section 9.3.2 extends in a natural way to sequential games. This yields a security plan in which each player performs worst-case analysis by treating the other player as nature under nondeterministic uncertainty. A security plan and its resulting cost can be computed by propagating costs from the leaves up to the root. The computation of the security plan for P1 for the game in Figure 10.13 is shown in Figure 10.14. The actions that would be chosen by P2 are determined at all vertices in the second-to-last level of the tree. Since P2 tries to maximize costs, the recorded costs at each of these vertices is the maximum over the costs of its children. At the next higher level, the actions that would be chosen by P1 are determined. At each vertex, the minimum cost among its children is recorded. In the next level, P2 is considered, and so on, until the
root is reached. At this point, the lowest cost that P1 could secure is known. This yields the upper value, L∗, for the sequential game. The security plan is defined by providing the action that selects the lowest cost child vertex, for each n1 ∈ N1.
If P2 responds rationally to the security plan of P1, then the path shown in bold
in Figure 10.14d will be followed. The execution of P1’s security plan yields the action sequence (L, L) for P1 and (R, L) for P2. The upper value is L∗ = 1.
A security plan for P2 can be computed similarly; however, the order of the decisions must be swapped. This is not easy to visualize, unless the order of the players is swapped in the tree. If P2 acts first, then the resulting tree is as shown in Figure 10.15. The costs on the leaves appear in different order; however, for
P1 P1
P2 P2
P1 P1
P2 P2
4 2 0
0 1 0 3 2 2 3 1 2 4 1 3 2
4 2 0
0 1 0 3 2 2 3 1 2 4 1 3 2
- P2 chooses (b) P1 chooses
P1 P1
P2 P2
P1 P1
| 4 | 0 | 1 | 3 | 3 | 2 | 4 | 3 | 4 | 0 | 1 | 3 | 3 | 2 | 4 | 3 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P2 | 4 | 2 | 0 | 0 | 1 0 | 3 | 2 | 2 3 | 1 | 2 | 4 | 1 | 3 | 2 | P2 | 4 | 2 | 0 | 0 | 1 0 3 | 2 2 | 3 1 | 2 | 4 | 1 | 3 | 2 |
(c) P2 chooses (d) P1 chooses
Figure 10.14: The security plan for P1 is determined by propagating costs upward from the leaves. The choices involved in the security plan are shown in the last picture. An upper value of 1 is obtained for the game.
the same action sequences chosen by P1 and P2, the costs obtained at the end of the game are the same as those in Figure 10.14. The resulting lower value for the game is found to be L∗ = 1. The resulting security plan is defined by assigning
the action to each n2 ∈ N2 that maximizes the cost value of its children. If P1
responds rationally to the security plan of P2, then the actions executed will be (L, L) for P1 and (R, L) for P2. Note that these are the same as those obtained from executing the security plan of P1, even though they appear different in the trees because the player order was swapped. In many cases, however, different action sequences will be obtained.
As in the case of a single-stage game, L∗ = L∗ implies that the game has a deterministic saddle point and the value of the sequential game is L∗ = L∗ = L∗. This particular game has a unique, deterministic saddle point. This yields
predictable, identical choices for the players, even though they perform separate, worst-case analyses.
A substantial reduction in the cost of computing the security strategies can be obtained by recognizing when certain parts of the tree do not need to be explored because they cannot yield improved costs. This idea is referred to as alpha-beta pruning in AI literature (see [839], pp. 186-187 for references and a brief history). Suppose that the tree is searched in depth-first order to determine the security strategy for P1. At some decision vertex for P1, suppose it has been determined that a cost c would be secured if a particular action, u, is applied; however, there
1
P2 acts
0 1
P2 acts
P1 acts
0 2 1 3
0 0 1 2 1 0 3 1
P1 acts
Cost 4 0 2
0 2 1
3 2 1 3 0 2 4 3 1 2
Figure 10.15: The security plan can be found for P2 by swapping the order of P1 and P2 (the order of the costs on the leaves also become reshuffled).
c = 1
P2 act Cost
Figure 10.16: If the tree is explored in depth-first order, there are situations in which some children (and hence whole subtrees) do not need to be explored. This is an example that eliminates children of P2. Another case exists, which eliminates children of P1.
are still other actions for which it is not known what costs could be secured. Consider determining the cost that could be secured for one of these remaining actions, denoted by u′. This requires computing how P2 will maximize cost to respond to u′. As soon as P2 has at least one option for which the cost, c′, is greater than c, the other children of P2 do not need to be explored. Why? This is because P1 would never choose u′ if P2 could respond in a way that leads to a higher cost than what P1 can already secure by choosing u. Figure 10.16 shows a simple example. This situation can occur at any level in the tree, and when an action does not need to be considered, an entire subtree is eliminated. In other situations, children of P1 can be eliminated because P2 would not make a choice that allows P1 to improve the cost below a value that P2 can already secure for itself.
P1 acts
L R
P2 acts
L R L R
| 4 | 2 |
|---|---|
| 0 | 0 |
| 1 | 0 |
|---|---|
| 3 | 2 |
| 2 | 3 |
|---|---|
| 1 | 2 |
| 4 | 1 |
|---|---|
| 3 | 2 |
Figure 10.17: Under the stage-by-stage model, the game in Figure 10.13 can in- stead be represented as a tree in which each player acts once, and then they play a matrix game to determine the cost.
- Computing a saddle point
The security plan for P1 constitutes a valid solution to the game under the alter- nating play model. P2 has only to choose an optimal response to the plan of P1 at each stage. Under the stage-by-stage model, the “solution” concept is a saddle point, which occurs when the upper and lower values coincide. The procedure just described could be used to determine the value and corresponding plans; however, what happens when the values do not coincide? In this case, randomized security plans should be developed for the players. As in the case of a single-stage game, a
randomized upper value ∗ and a randomized lower value ∗ are obtained. In the
L L
space of randomized plans, it turns out that a saddle point always exists. This implies that the game always has a randomized value, ∗ = ∗ = ∗. This saddle
L L L
point can be computed from the bottom up, in a manner similar to the method just used to compute security plans.
Return to the example in Figure 10.13. This game actually has a deterministic saddle point, as indicated previously. It still, however, serves as a useful illustration of the method because any deterministic plan can once again be interpreted as a special case of a randomized plan (all of the probability mass is placed on a single action). Consider the bottom four subtrees of Figure 10.13, which are obtained by using only the last two levels of decision vertices. In each case, P1 and P2 must act in parallel to end the sequential game. Each subtree can be considered as a matrix game because the costs are immediately obtained after the two players act. This leads to an alternative way to depict the game in Figure 10.13, which is shown in Figure 10.17. The bottom two layers of decision vertices are replaced by matrix games. Now compute the randomized value for each game and place it at the corresponding leaf vertex, as shown in Figure 10.18. In the example, there are only two layers of decision vertices remaining. This can be represented as the
game
V
U , (10.105)
| 0 | 1 |
|---|---|
| 2 | 3 |
P1 acts
L R
P2 acts
L R L R
0 1 2 3
Figure 10.18: Each matrix in Figure 10.17 can be replaced by its randomized value. This clips one level from the original tree. For this particular example, the randomized value is also a deterministic value. Note that these are exactly the costs that appeared in Figure 10.14c. This occurred because each of the matrix games has a deterministic value; if they do not, then the costs will not coincide.
which has a value of 1 and occurs if P1 applies L and P2 applies R. Thus, the solution to the original sequential game has been determined by solving matrix games as an alternative to the method applied to obtain the security plans. The benefit of the new method is that if any matrix does not have a deterministic saddle point, its randomized value can instead be computed. A randomized strategy must be played by the players if the corresponding decision vertex is reached during execution.
- Converting the tree to a single-stage game
Up to this point, solutions have been determined for the alternating-play and the stage-by-stage models. The open-loop model remains. In this case, there is no exchange of information between the players until the game is finished and they receive their costs. Therefore, imagine that players engaged in such a sequential game are equivalently engaged in a large, single-stage game. Recall that a plan
under the open-loop model is a function over K. Let Π1 and Π2 represent the
sets of possible plans for P1 and P2, respectively. For the game in Figure 10.13, Πi is a set of four possible plans for each player, which will be specified in the following order: (L, L), (L, R), (R, L), and (R, R). These can be arranged into a
4 × 4 matrix game:
Π2
| 4 | 2 | 1 | 0 |
|---|---|---|---|
| 0 | 0 | 3 | 2 |
| 2 | 3 | 4 | 1 |
| 1 | 2 | 3 | 2 |
. (10.106)
Π
1
This matrix game does not have a deterministic saddle point. Unfortunately, a four-dimensional linear programming problem must be solved to find the ran- domized value and equilibrium. This is substantially different than the solution obtained for the other two information models.
The matrix-game form can also be derived for sequential games defined under the stage-by-stage model. In this case, however, the space of plans is even larger.
For the example in Figure 10.13, there are 32 possible plans for each player (there are 5 decision vertices for each player, at which two different actions can be applied; hence, Πi = 2 for i = 1 and i = 2). This results in a 32 32 matrix game! This game should admit the same saddle point solution that we already determined. The advantage of using the tree representation is that this enormous game was decomposed into many tiny matrix games. By treating the problem stage-by-stage, substantial savings in computation results. This power arises because the dynamic programming principle was implicitly used in the tree-based computation method of decomposing the sequential game into small matrix games. The connection to previous dynamic programming methods will be made clearer in the next section, which considers sequential games that are defined over a state space.
5
| | ×
- Sequential Games on State Spaces
An apparent problem in the previous section is that the number of vertices grows exponentially in the number of stages. In some games, however, there may be multiple action sequences that lead to the same state. This is true of many popular games, such as chess, checkers, and tic-tac-toe. In this case, it is convenient to define a state space that captures the complete set of unique game configurations. The player actions then transform the state. If there are different action sequences that lead to the same state, then separate vertices are not needed. This converts the game tree into a game graph by declaring vertices that represent the same state to be equivalent. The game graph is similar in many ways to the transition graphs discussed in Section 10.1, in the sequential game against nature. The same idea can be applied when there are opposing players.
We will arrive at a sequential game that is played over a state space by collaps- ing the game tree into a game graph. We will also allow the more general case of costs occurring on any transition edges, as opposed to only the leaves of the orig- inal game tree. Only the stage-by-stage model from the game tree is generalized here. Generalizations that use other information models are considered in Section
11.7. In the formulation that follows, P2 can be can viewed as the replacement for nature in Formulation 10.1. The new formulation is still a generalization of Formulation 9.7, which was a single-stage, zero-sum game. To keep the concepts simpler, all spaces are assumed to be finite. The formulation is as follows.
Formulation 10.4 (Sequential Zero-Sum Game on a State Space)
- Two players, P1 and P2.
- A finite, nonempty state space X.
- For each state x ∈ X, a finite, nonempty action space U(x) for P1.
- For each state x X, a finite, nonempty action space V (x) for P2. To allow an extension of the alternating play model from Section 10.5.1, V (x, u) could
∈
alternatively be defined, to enable the set of actions available to P2 to depend on the action u ∈ U of P1.
- A state transition function f that produces a state, f(x, u, v), for every
x ∈ X, u ∈ U(x), and v ∈ V (x).
- A set of K stages, each denoted by k, which begins at k = 1 and ends at k = K. Let F = K + 1, which is the final stage, after the last action is applied.
K
- An initial state xI X. For some problems, this may not be specified, in which case a solution must be found from all initial states.
∈
- A stage-additive cost functional L. Let v˜K denote the history of P2’s actions up to stage K. The cost functional may be applied to any combination of state and action histories to yield
K
-
L(x˜F , u˜K , v˜K ) = l(xk, uk, vk) + lF (xF ). (10.107)
k=1
It will be assumed that both players always know the current state. Note that there are no termination actions in the formulation. The game terminates after each player has acted K times. There is also no direct formulation of a goal set. Both termination actions and goal sets can be added to the formulation without difficulty, but this is not considered here. The action sets can easily be extended to allow a dependency on the stage, to yield U(x, k) and V (x, k). The methods presented in this section can be adapted without trouble. This is avoided, however, to make the notation simpler.
Defining a plan for each player Each player must now have its own plan. As in Section 10.1, it seems best to define a plan as a mapping from states to actions, because it may not be clear what actions will be taken by the other decision maker. In Section 10.1, the other decision maker was nature, and here it is a rational opponent. Let π1 and π2 denote plans for P1 and P2, respectively. Since the number of stages in Formulation 10.4 is fixed, stage-dependent plans of the form π1 : X U and π2 : X V are appropriate (recall that stage-dependent plans were defined in Section 10.1.3). Each produces an action
× K → × K →
π1(x, k) ∈ U(x) and π2(x, k) ∈ V (x), respectively.
Now consider different solution concepts for Formulation 10.4. For P1, a
deterministic plan is a function π1 : X × K → U, that produces an action
u = π(x) ∈ U(x), for each state x ∈ X and stage k ∈ K. For P2 it is in-
stead π2 : X V , which produces an action v = π(x) V (x), for each x X
× K → ∈ ∈
and k . Now consider defining a randomized plan. Let W (x) and Z(x) denote
∈ K
the sets of all probability distributions over U(x) and V (x), respectively. A ran- domized plan for P1 yields some w ∈ W (x) for each x ∈ X and k ∈ K. Likewise, a randomized plan for P2 yields some z ∈ Z(x) for each x ∈ X and k ∈ K.
Saddle points in a sequential game A saddle point will be obtained once again by defining security strategies for each player. Each player treats the other as nature, and if the same worst-case value is obtained, then the result is a saddle point for the game. If the values are different, then a randomized plan is needed to close the gap between the upper and lower values.
Upper and lower values now depend on the initial state, x1 X. There was no equivalent for this in Section 10.5.1 because the root of the game tree is the only possible starting point.
∈
If sequences, u˜K and v˜K , of actions are applied from x1, then the state history, x˜F , can be derived by repeatedly using the state transition function, f. The upper value from x1 is defined as
L∗(x1) = min max min max · · · min max JL(x˜F , u˜K , v˜K )\, (10.108)
_u_1
_v_1
_u_2
_v_2
uK
vK
which is identical to (10.33) if P2 is replaced by nature. Also, (10.108) generalizes (9.44) to multiple stages. The lower value from x1, which generalizes (9.46), is
L∗(x1) = max min max min · · · max min JL(x˜F , u˜K , v˜K )\. (10.109)
_v_1
_u_1
_v_2
_u_2
vK
uK
If L∗(x1) = L∗(x2), then a deterministic saddle point exists from x1. This implies that the order of max and min can be swapped inside of every stage.
Value iteration A value-iteration method can be derived by adapting the deriva- tion that was applied to (10.33) to instead apply to (10.108). This leads to the dynamic programming recurrence
L∗ (xk) = min
k
J max
Jl(xk, uk, vk) + L∗
(xk+1)\\, (10.110)
which is analogous to (10.39). This can be used to iteratively compute a security plan for P1. The security plan for P2 can be computed using
uk ∈U (xk )
vk ∈V (xk )
k+1
L∗k(xk) = max min
J
k ∈V (xk
k ∈U (xk
v ) u
) Jl(xk, uk, vk) + L∗k+1(xk+1)\\, (10.111)
which is the dynamic programming equation derived from (10.109).
Starting from the final stage, F, the upper and lower values are determined directly from the cost function:
L∗ (xF ) = L∗ (xF ) = lF (xF ). (10.112)
F F
Now compute L∗ and L∗K . From every state, xK , (10.110) and (10.111) are eval-
K
uated to determine whether L∗ (xK ) = L∗ (xK ). If this occurs, then L∗ (xK ) =
K K L
L∗ (xK ) = L∗ (xK ) is the value of the game from xK at stage K. If it is deter-
K K
mined that from any particular state, xK ∈ X, the upper and lower values are not
equal, then there is no deterministic saddle point from xK . Furthermore, this will
prevent the existence of deterministic saddle points from other states at earlier stages; these are encountered in later value iterations. Such problems are avoided by allowing randomized plans, but the optimization is more complicated because linear programming is repeatedly involved.
Suppose for now that L∗ (xK ) = L∗ (xK ) for all xK ∈ X. The value iterations
K
K
proceed in the usual way from k = K down to k = 1. Again, suppose that at
every stage, L∗ (xk) = L∗ (xk) for all xk ∈ X. Note that L∗ can be written in
k
the place of L∗
k+1
k
and L∗k+1
k+1
in (10.110) and (10.111) because it is assumed that
the upper and lower values coincide. If they do not, then the method fails because
randomized plans are needed to obtain a randomized saddle point.
Once the resulting values are computed from each x1 ∈ X1, a security plan π1∗
for P1 is defined for each k and xk X as any action u that satisfies the min in (10.110). A security plan π2∗ is similarly defined for P2 by applying any action v that satisfies the max in (10.111).
∈ K ∈
Now suppose that there exists no deterministic saddle point from one or more initial states. To avoid regret, randomized security plans must be developed. These follow by direct extension of the randomized security strategies from Section 9.3.3. The vectors w and z will be used here to denote probability distributions over U(x) and V (x), respectively. The probability vectors are selected from W (x) and Z(x), which correspond to the set of all probability distributions over U(x) and V (x), respectively. For notational convenience, assume U(x) = 1, . . . , m(x) and V (x) = 1, . . . , n(x) , in which m(x) and n(x) are positive integers.
{ }
{ }
Recall (9.61) and (9.62), which defined the randomized upper and lower values of a single-stage game. This idea is generalized here to randomized upper and lower value of a sequential game. Their definitions are similar to (10.108) and (10.109), except that: 1) the alternating min’s and max’s are taken over probability distributions on the space of actions, and 2) the expected cost is used.
The dynamic programming principle can be applied to the randomized upper value to derive
( ( m(xk ) n(xk )(
k
w∈W (xk )
z∈Z(xk )
i=1
j=1
\ __
L∗ (xk) = min max - - l(xk, i, j)+L∗
k+1
i=1
j=1
(xk+1)
wi_z_j
, (10.113)
i=1
j=1
in which xk+1 = f(xk, i, j). The randomized lower value is similarly obtained as
( ( m(xk ) n(xk )(
-i=1
-j=1
L∗k(xk) = z max )
∈Z(xk
min
w∈W (xk
)
\ __
wi_z_j
. (10.114)
L∗k(xk) = z max )
∈Z(xk
min
w∈W (xk
)
l(xk, i, j)+Lk∗+1(xk+1)
wi_z_j
. (10.114)
In many games, the cost term may depend only on the state: l(x, u, v) = l(x) for all x ∈ X, u ∈ U(x) and v ∈ V (x). In this case, (10.113) and (10.114) simplify
to
to
( (
__
m(xk ) n(xk )
k
w∈W (xk )
z∈Z(xk )
k+1
L∗ (xk) = min max l(xk) + - - L∗ (xk+1)wi_z_j
i=1
j=1
(10.115)
i=1
j=1
and
( ( m(xk ) n(xk ) __
∗k(xk) = max
L z )
-
∈Z(xk
min
w∈W (xk )
l(xk) +
-i=1
Lk∗+1(xk+1)wi_z_j
j=1
, (10.116)
which is similar to the simplification obtained in (10.46), in which θk was assumed not to appear in the cost term. The summations are essentially generalizations of (9.57) to the multiple-stage case. If desired, these could even be written as matrix multiplications, as was done in Section 9.3.3.
Value iteration can be performed over the equations above to obtain the ran- domized values of the sequential game. Since the upper and lower values are always the same, there is no need to check for discrepancies between the two. In practice, it is best in every evaluation of (10.113) and (10.114) (or their simpler forms) to first check whether a deterministic saddle exists from xk. Whenever one does not exist, the linear programming problem formulated in Section 9.3.3 must be solved to determine the value and the best randomized plan for each player. This can be avoided if a deterministic saddle exists from the current state and stage.
- Other Sequential Games
Most of the ideas presented so far in Section 10.5 extend naturally to other se- quential game problems. This subsection briefly mentions some of these possible extensions.
Nash equilibria in sequential games Formulations 10.3 and 10.4 can be ex- tended to sequential nonzero-sum games. In the case of game trees, a cost vector, with one element for each player, is written at each of the leaves. Under the stage-by-stage model, deterministic and randomized Nash equilibria can be com- puted using the bottom-up technique that was presented in Section 10.5.1. This will result in the computation of a single Nash equilibrium. To represent all Nash equilibria is considerably more challenging. As usual, the game tree is decomposed into many matrix games; however, in each case, all Nash equilibria must be found and recorded along with their corresponding costs. Instead of propagating a sin- gle cost up the tree, a set of cost vectors, along with the actions associated with each cost vector, must be propagated up the tree to the root. As in the case of a single-stage game, nonadmissible Nash equilibria can be removed from consid- eration. Thus, from every matrix game encountered in the computation, only the admissible Nash equilibria and their costs should be propagated upward.
Formulation 10.4 can be extended by introducing the cost functions L1 and L2 for P1 and P2, respectively. The value-iteration approach can be extended in a way similar to the extension of the game tree method. Multiple value vectors and their corresponding actions must be maintained for each combination of state and stage. These correspond to the admissible Nash equilibria.
The nonuniqueness of Nash equilibria causes the greatest difficulty in the se- quential game setting. There are typically many more equilibria in a sequential
Nature acts
1_/_3
2_/_3
P1 acts L R L R
P2 acts L R L
R L R L R
Cost
3 −2 −6 3 3 −1 6 0
Figure 10.19: This is a single-stage, zero-sum game that involves nature. It is assumed that all players act at the same time.
game than in a single-stage game. Therefore, the concept is not very useful in the design of a planning approach. It may be more useful, for example, in modeling the possible outcomes of a complicated economic system. A thorough treatment of the subject appears in [59].
Introducing nature A nature player can easily be introduced into a game. Suppose, for example, that nature is introduced into a zero-sum game. In this case, there are three players: P1, P2, and nature. Figure 10.19 shows a game tree for a single-stage, zero-sum game that involves nature. It is assumed that all three players act at the same time, which fits the stage-by-stage model. Many other information models are possible. Suppose that probabilistic uncertainty is used to model nature, and it is known that nature chooses the left branch with probability 1/3 and the right branch with probability 2/3. Depending on the branch chosen by nature, it appears that P1 and P2 will play a specific 2 2 matrix game. With probability 1/3, the cost matrix will be
×
U
and with probability 2/3 it will be
V
, (10.117)
| 3 | -2 |
|---|---|
| -6 | 3 |
V
U . (10.118)
| 3 | -1 |
|---|---|
| 6 | 0 |
Unfortunately, P1 and P2 do not know which matrix game they are actually play- ing. The regret can be eliminated in the expected sense, if the game is played over many independent trials. Let A1 and A2 denote (10.117) and (10.118), respec- tively. Define a new cost matrix as A = (1/3)A1 + (2/3)A2 (a scalar multiplied by
a matrix scales every value of the matrix). The resulting matrix is
V
U . (10.119)
| 3 | 0 |
|---|---|
| 2 | 1 |
This matrix game has a deterministic saddle point in which P1 chooses L (row 2) and P2 chooses R (column 1), which yields a cost of 2. This means that they can play a deterministic strategy to obtain an expected cost of 2, if the game play is averaged over many independent trials. If this matrix did not admit a deterministic saddle point, then a randomized strategy would be needed. It is interesting to note that randomization is not needed for this example, even though P1 and P2 each play against both nature and an intelligent adversary.
Several other variations are possible. If nature is modeled nondeterministically, then a matrix of worst-case regrets can be formed to determine whether it is possible to eliminate regret. A sequential version of games such as the one in Figure 10.19 can be considered. In each stage, there are three substages in which nature, P1, and P2 all act. The bottom-up approach from Section 10.5.1 can be applied to decompose the tree into many single-stage games. Their costs can be propagated upward to the root in the same way to obtain an equilibrium solution. Formulation 10.4 can be easily extended to include nature in games over state spaces. For each x, a nature action set is defined as Θ(x). The state transition
equation is defined as
xk+1 = f(xk, uk, vk, θk), (10.120)
which means that the next state depends on all three player actions, in addition to the current state. The value-iteration method can be extended to solve problems of this type by properly considering the effect of nature in the dynamic programming equations. In the probabilistic case, for example, an expectation over nature is needed in every iteration. The resulting sequential game is often referred to as a Markov game [774].
Introducing more players Involving more players poses no great difficulty, other than complicating the notation. For example, suppose that a set of n play- ers, P1, P2, . . ., Pn, takes turns playing a game. Consider using a game tree representation. A stage is now stretched into n substages, in which each player acts individually. Suppose that P1 always starts, followed by P2, and so on, until Pn. After Pn acts, then the next stage is started, and P1 acts. The circular se- quence of player alternations continues until the game ends. Again, many different information models are possible. For example, in the stage-by-stage model, each player does not know the action chosen by the other n 1 players in the current stage. The bottom-up computation method can be used to compute Nash equi- libria; however, the problems with nonuniqueness must once again be confronted. A state-space formulation that generalizes Formulation 10.4 can be made by
−
introducing action sets U i(x) for each player Pi and state x ∈ X. Let ui
k
denote
the action chosen by Pi at stage k. The state transition becomes
xk+1 = f(xk, u1 , u2 , . . . , un). (10.121)
k k k
There is also a cost function, Li, for each Pi. Value iteration, adapted to maintain multiple equilibria and cost vectors can be used to compute Nash equilibria.