9.3 Two-Player Zero-Sum Games

Section 9.2 involved one real decision maker (DM), the robot, playing against a ﬁctitious DM called nature. Now suppose that the second DM is a clever opponent that makes decisions in the same way that the robot would. This leads to a symmetric situation in which two decision makers simultaneously make a decision, without knowing how the other will act. It is assumed in this section that the DMs have diametrically opposing interests. They are two players engaged in a game in which a loss for one player is a gain for the other, and vice versa. This results in the most basic form of game theory, which is referred to as a zero-sum game.

9.3.1 Game Formulation

Suppose there are two players, P1 and P2, that each have to make a decision. Each has a ﬁnite set of actions, U and V , respectively. The set V can be viewed as the “replacement” of Θ from Formulation 9.3 by a set of actions chosen by a true opponent. Each player has a cost function, which is denoted as Li : U V R for i = 1, 2. An important constraint for zero-sum games is

× →

L1(u, v) = −L2(u, v), (9.41)

which means that a cost for one player is a reward for the other. This is the basis of the term zero sum, which means that the two costs can be added to obtain zero. In zero-sum games the interests of the players are completely opposed. In Section

1. this constraint will be lifted to obtain more general games.

In light of (9.41) it is pointless to represent two cost functions. Instead, the superscript will be dropped, and L will refer to the cost, L1, of P1. The goal of P1 is to minimize L. Due to (9.41), the goal of P2 is to maximize L. Thus, L can be considered as a reward for P2, but a cost for P1.

A formulation can now be given:

Formulation 9.7 (A Zero-Sum Game)

- 1. Two players, P1 and P2.
  2. A nonempty, ﬁnite set U called the action space for P1. For convenience in

describing examples, assume that U is a set of consecutive integers from 1 to |U|. Each u ∈ U is referred to as an action of P1.

- 1. A nonempty, ﬁnite set V called the action space for P2. Assume that V is a set of consecutive integers from 1 to |V |. Each v ∈ V is referred to as an

action of P2.

- 1. A function L : U × V → R ∪ {−∞, ∞} called the cost function for P1. This also serves as a reward function for P2 because of (9.41).

Before discussing what it means to solve a zero-sum game, some additional assumptions are needed. Assume that the players know each other’s cost functions. This implies that the motivation of the opponent is completely understood. The other assumption is that the players are rational, which means that they will try to obtain the best cost whenever possible. P1 will not choose an action that leads to higher cost when a lower cost action is available. Likewise, P2 will not choose an action that leads to lower cost. Finally, it is assumed that both players make their decisions simultaneously. There is no information regarding the decision of P1 that can be exploited by P2, and vice versa.

Formulation 9.7 is often referred to as a matrix game because L can be ex-

pressed with a cost matrix, as was done in Section 9.2. Here the matrix indicates costs for P1 and P2, instead of the robot and nature. All of the required in- formation from Formulation 9.7 is speciﬁed by a single matrix; therefore, it is a convenient form for expressing zero-sum games.

Example 9.13 (Matrix Representation of a Zero-Sum Game) Suppose that U, the action set for P1, contains three actions and V contains four actions. There should be 3 4 = 12 values in the speciﬁcation of the cost function, L. This can be expressed as a cost matrix,

1	3	3	2
0	-1	2	1
-2	2	0	1

U , (9.42)

in which each row corresponds to some u ∈ U, and each column corresponds to

some v V . Each entry yields L(u, v), which is the cost for P1. This representa- tion is similar to that shown in Example 9.8, except that the nature action space,

∈

Θ, is replaced by V . The cost for P2 is −L(u, v). ．

- 1. Deterministic Strategies

What constitutes a good solution to Formulation 9.7? Consider the game from the perspective of P1. It seems reasonable to apply worst-case analysis when trying

to account for the action that will be taken by P2. This results in a choice that is equivalent to assuming that P2 is nature acting under the nondeterministic model, as considered in Section 9.2.2. For a matrix game, this is computed by ﬁrst determining the maximum cost over each row. Selecting the action that produces the minimum among these represents the lowest cost that P1 can guarantee for itself. Let this selection be referred to as a security strategy for P1.

For the matrix game in (9.42), the security strategy is illustrated as

U , (9.43)

1	3	3	2	→ 3
0	-1	2	1	→ 2
-2	2	0	1	→ 2

in which u = 2 and u = 3 are the best actions. Each yields a cost no worse than 2, regardless of the action chosen by P2.

This can be formalized using the existing notation. A security strategy, u∗, for

P1 is deﬁned in general as

u∗ = argmin J max JL(u, v)＼＼. (9.44)

u∈U

v∈V

There may be multiple security strategies that satisfy the argmin; however, this

does not cause trouble, as will be explained shortly. Let the resulting worst-case cost be denoted by L∗, and let it be called the upper value of the game. This is

deﬁned as

J ＼

L∗ = max L(u∗, v) . (9.45)

v∈V

Now swap roles, and consider the game from the perspective of P2, which would like to maximize L. It can also use worst-case analysis, which means that it would like to select an action that guarantees a high cost, in spite of the action of P1 to potentially reduce it. A security strategy, v∗, for P2 is deﬁned as

v∗ = argmax J min JL(u, v)＼＼. (9.46)

v∈V

u∈U

Note the symmetry with respect to (9.44). There may be multiple security strate- gies for P2. A security strategy v∗ is just an “upside-down” version of the worst- case analysis applied in Section 9.2.2. The lower value, L∗, is deﬁned as

L∗ = min L(u, v∗) . (9.47)

J ＼

u∈U

Returning to the matrix game in (9.42), the last column is selected by applying (9.46):

1	3	3	2
↓	↓	↓	↓

U . (9.48)

An interesting relationship between the upper and lower values is that L∗ L∗

≤

for any game using Formulation 9.7. This is shown by observing that

L∗ = min JL(u, v∗)＼ ≤ L(u∗, v∗) ≤ max JL(u∗, v)＼ = L∗, (9.49)

u∈U

v∈V

in which L(u∗, v∗) is the cost received when the players apply their respective security strategies. If the game is played by rational DMs, then the resulting cost

always lies between L∗ and L∗.

Regret Suppose that the players apply security strategies, u∗ = 2 and v∗ = 4. This results in a cost of L(2, 4) = 1. How do the players feel after the outcome? P1 may feel satisﬁed because given that P2 selected v∗ = 4, it received the lowest cost possible. On the other hand, P2 may regret its decision in light of the action chosen by P1. If it had known that u = 2 would be chosen, then it could have picked v = 2 to receive cost L(2, 2) = 2, which is better than L(2, 4) = 1. If the game were to be repeated, then P2 would want to change its strategy in hopes of tricking P1 to obtain a higher reward.

Is there a way to keep both players satisﬁed? Any time there is a gap between

L∗ and L∗, there is regret for one or both players. If r1 and r2 denote the amount

of regret experienced by P1 and P2, respectively, then the total regret is

r1 + r2 = L∗ − L∗. (9.50)

Thus, the only way to satisfy both players is to obtain upper and lower values such that L∗ = L∗. These are properties of the game, however, and they are not

up to the players to decide. For some games, the values are equal, but for many

L∗ < L∗. Fortunately, by using randomized strategies, the upper and lower values

always coincide; this is covered in Section 9.3.3.

Saddle points If L∗ = L∗, the security strategies are called a saddle point, and

L∗ = L∗ = L∗ is called the value of the game. If this occurs, the order of the max

and min can be swapped without changing the value:

L∗ = min J max JL(u, v)＼＼ = max J min JL(u, v)＼＼. (9.51)

u∈U

v∈V

u∈U

A saddle point is sometimes referred to as an equilibrium because the players

have no incentive to change their choices (because there is no regret). A saddle point is deﬁned as any u∗ ∈ U and v∗ ∈ V such that

L(u∗, v) ≤ L(u∗, v∗) ≤ L(u, v∗) (9.52)

for all u U and v V . Note that L∗ = L(u∗, v∗). When looking at a matrix game, a saddle point is found by ﬁnding the simple pattern shown in Figure 9.2.

∈ ∈

	≥
≤	L∗	≤
	≥

Figure 9.2: A saddle point can be detected in a matrix by ﬁnding a value L∗ that is lowest among all elements in its column and greatest among all elements in its row.

Example 9.14 (A Deterministic Saddle Point) Here is a matrix game that has a saddle point:

3	3	5
1	-1	7
0	-2	4

U . (9.53)

By applying (9.52) (or using Figure 9.2), the saddle point is obtained when u = 3 and v = 3. The result is that L∗ = 4. In this case, neither player has regret after the game is ﬁnished. P1 is satisﬁed because 4 is the lowest cost it could have received, given that P2 chose the third column. Likewise, 4 is the highest cost that

P2 could have received, given that P1 chose the bottom row. ．

What if there are multiple saddle points in the same game? This may appear to be a problem because the players have no way to coordinate their decisions. What if P1 tries to achieve one saddle point while P2 tries to achieve another? It turns out that if there is more than one saddle point, then there must at least be four, as shown in Figure 9.3. As soon as we try to make two “+” patterns like the one shown in Figure 9.2, they intersect, and four saddle points are created. Similar behavior occurs as more saddle points are added.

Example 9.15 (Multiple Saddle Points) This game has multiple saddle points and follows the pattern in Figure 9.3:

4	3	5	1	2
-1	0	-2	0	-1
-4	1	4	3	5
-3	0	-1	0	-2
3	2	-7	3	8

U . (9.54)

Let (i, j) denote the pair of choices for P1 and P2, respectively. Both (2, 2) and (4, 4) are saddle points with value V = 0. What if P1 chooses u = 2 and P2 chooses

Figure 9.3: A matrix could have more than one saddle point, which may seem

to lead to a coordination problem between the players. Fortunately, there is no problem, because the same value will be received regardless of which saddle point is selected by each player.

v = 4? This is not a problem because (2, 4) is also a saddle point. Likewise, (4, 2) is another saddle point. In general, no problems are caused by the existence of multiple saddle points because the resulting cost is independent of which saddle

point is attempted by each player. ．

- 1. Randomized Strategies

The fact that some zero-sum games do not have a saddle point is disappointing because regret is unavoidable in these cases. Suppose we slightly change the rules. Assume that the same game is repeatedly played by P1 and P2 over numerous trials. If they use a deterministic strategy, they will choose the same actions every time, resulting in the same costs. They may instead switch between alternative security strategies, which causes ﬂuctuations in the costs. What happens if they each implement a randomized strategy? Using the idea from Section 9.1.3, each strategy is speciﬁed as a probability distribution over the actions. In the limit, as the number of times the game is played tends to inﬁnity, an expected cost is obtained. One of the most famous results in game theory is that on the space of randomized strategies, a saddle point always exists for any zero-sum matrix game; however, expected costs must be used. Thus, if randomization is used, there will be no regrets. In an individual trial, regret may be possible; however, as the costs are averaged over all trials, both players will be satisﬁed.

- - 1. Extending the formulation

Since a game under Formulation 9.7 can be nicely expressed as a matrix, it is tempting to use linear algebra to conveniently express expected costs. Let |U| = m

and V = n. As in Section 9.1.3, a randomized strategy for P1 can be represented as an m-dimensional vector,

| |

w = [w1 w2 . . . wm]. (9.55)

The probability axioms of Section 9.1.2 must be satisﬁed: 1) wi ≥ 0 for all i ∈

{1, . . . , m}, and 2) w1 +· · ·+ wm = 1. If w is considered as a point in R , then the

two constraints imply that it must lie on an (m 1)-dimensional simplex (recall

−

Section 6.3.1). If m = 3, this means that w lies in a triangular subset of R3. Similarly, let z represent a randomized strategy for P2 as an n-dimensional vector,

z = [z1 z2 . . . zn]T , (9.56)

that also satisﬁes the probability axioms. In (9.56), T denotes transpose, which yields a column vector that satisﬁes the dimensional constraints required for an upcoming matrix multiplication.

Let L¯(w, z) denote the expected cost that will be received if P1 plays w and

P2 plays z. This can be computed as

m n

L¯(w, z) = - - L(i, j)wi_z_j . (9.57)

i=1

j=1

i=1

j=1

Note that the cost, L(i, j), makes use of the assumption in Formulation 9.7 that the actions are consecutive integers. The expected cost can be alternatively expressed using the cost matrix, A. In this case

L¯(w, z) = wAz, (9.58)

in which the product wAz yields a scalar value that is precisely (9.57). To see this, ﬁrst consider the product Az. This yields an m-dimensional vector in which the ith element is the expected cost that P1 would receive if it tries u = i. Thus, it appears that P1 views P2 as a nature player under the probabilistic model. Once w and Az are multiplied, a scalar value is obtained, which averages the costs in the vector Az according the probabilities of w.

Let W and Z denote the set of all randomized strategies for P1 and P2, re- spectively. These spaces include strategies that are equivalent to the deterministic strategies considered in Section 9.3.2 by assigning probability one to a single action. Thus, W and Z can be considered as expansions of the set of possible strategies in comparison to what was available in the deterministic setting. Using W and Z, randomized security strategies for P1 and P2 are deﬁned as

w∗ = argmin J max JL¯(w, z)＼＼ (9.59)

and

w∈W

z∈Z

z∗ = argmax J min JL¯(w, z)＼＼, (9.60)

z∈Z

w∈W

respectively. These should be compared to (9.44) and (9.46). The diﬀerences are

that the space of strategies has been expanded, and expected cost is now used.

The randomized upper value is deﬁned as

∗ = max L¯(w∗, z) , (9.61)

L J ＼

z∈Z

and the randomized lower value is

∗ = min L¯(w, z∗) . (9.62)

L J ＼

w∈W

Since W and Z include the deterministic security strategies, ∗ L∗ and ∗ L∗. These inequalities imply that the randomized security strategies may have some hope in closing the gap between the two values in general.

L ≤ L ≥

The most fundamental result in zero-sum game theory was shown by von Neu- mann [956, 957], and it states that ∗ = ∗ for any game in Formulation 9.7. This yields the randomized value ∗ = ∗ = ∗ for the game. This means that there

L L L

L L

will never be expected regret if the players stay with their security strategies. If the players apply their randomized security strategies, then a randomized saddle point is obtained. This saddle point cannot be seen as a simple pattern in the matrix A because it instead exists over W and Z.

The guaranteed existence of a randomized saddle point is an important result because it demonstrates the value of randomization when making decisions against an intelligent opponent. In Example 9.7, it was intuitively argued that random- ization seems to help when playing against an intelligent adversary. When playing the game repeatedly with a deterministic strategy, the other player could learn the strategy and win every time. Once a randomized strategy is used, the players will not experience regret.

- - 1. Computation of randomized saddle points

So far it has been established that a randomized saddle point always exists, but how can one be found? Two key observations enable a combinatorial solution to the problem:

- - - 1. The security strategy for each player can be found by considering only de- terministic strategies for the opposing player.
      2. If the strategy for the other player is ﬁxed, then the expected cost is a linear function of the undetermined probabilities.

First consider the problem of determining the security strategy for P1. The ﬁrst observation means that (9.59) does not need to consider randomized strategies for P2. Inside of the argmin, w is ﬁxed. What randomized strategy, z Z, maximizes L¯(w, z) = wAz? If w is ﬁxed, then wA can be treated as a constant n-dimensional

∈

vector, s. This means L¯(w, z) = s · z, in which · is the inner (dot) product. Now the task is to select z to maximize s · z. This involves selecting the largest element of s; suppose this is si. The maximum cost over all z ∈ Z is obtained by placing

all of the probability mass at action i. Thus, the strategy zi = 1 and zj = 0 for

i /= j gives the highest cost, and it is deterministic.

Using the ﬁrst observation, for each w W , only n possible responses by P2 need to be considered. These are the n deterministic strategies, each of which

∈

assigns zi = 1 for a unique i ∈ {1, . . . , n}.

Now consider the second observation. The expected cost, L¯(w, z) = wAz, is a linear function of w, if z is ﬁxed. Since z only needs to be ﬁxed at n diﬀerent values due to the ﬁrst observation, w is selected at the point at which the smallest maximum value among the n linear functions occurs. This is the minimum value of the upper envelope of the collection of linear functions. Such envelopes were

mentioned in Section 6.5.2. Example 9.16 will illustrate this. The domain for this optimization can conveniently be set as a triangle in Rm−1. Even though

W ⊂ Rm, the last coordinate, wm, is not needed because it is always wm =

1 (w1 + + wm−1). The resulting optimization falls under linear programming, for which many combinatorial algorithms exist [259, 264, 664, 731].

− · · ·

In the explanation above, there is nothing particular to P1 when trying to ﬁnd its security strategy. The same method can be applied to determine the security strategy for P2; however, every minimization is replaced by a maximization, and vice versa. In summary, the min in (9.60) needs only to consider the deterministic strategies in W . If w becomes ﬁxed, then L¯(w, z) = wAz is once again a linear function, but this time it is linear in z. The best randomized action is chosen by ﬁnding the point z Z that gives the highest minimum value among m linear functions. This is the minimum value of the lower envelope of the collection of

∈

linear functions. The optimization occurs over Rn−1 because the last coordinate,

zn, is obtained directly from zn = 1 (z1 + + zn−1).

− · · ·

This computation method is best understood through an example.

Example 9.16 (Computing a Randomized Saddle Point) The simplest case is when both players have only two actions. Let the cost matrix be deﬁned as

U . (9.63)

3	0
-1	1

Consider computing the security strategy for P1. Note that W and Z are only one-dimensional subsets of R2. A randomized strategy for P1 is w = [w1 w2],

with w1 ≥ 0, w2 ≥ 0, and w1 + w2 = 1. Therefore, the domain over which

the optimization is performed is w1 ∈ [0, 1] because w2 can always be derived as w2 = 1 − w1. Using the ﬁrst observation above, only the two deterministic

strategies for P2 need to be considered. When considered as linear functions of w, these are

for z1 = 1 and

(3)w1 + (−1)(1 − w1) = 4w1 − 1 (9.64)

(0)w1 + (1)(1 − w1) = 1 − w1 (9.65)

for z2 = 1. The lines are plotted in Figure 9.4a. The security strategy is determined by the minimum point along the upper envelope shown in the ﬁgure. This is indicated by the thickened line, and it is always a piecewise-linear function in general. The lowest point occurs at w1 = 2/5, and the resulting value is ∗ = 3/5. Therefore, w∗ = [2/5 3/5].

3 3

2 2

_z_1 = 1

1 1

0 0

−1 _z_2 = 1 −1

_w_1 = 1

_w_2 = 1

0 2_/_5 1

_w_1

0 1_/_5 1

_z_1

Figure 9.4: (a) Computing the randomized security strategy, w∗, for P1. (b) Computing the randomized security strategy, z∗, for P2.

A similar procedure can be used to obtain z∗. The lines that correspond to the deterministic strategies of P1 are shown in Figure 9.4b. The security strategy is obtained by ﬁnding the maximum value along the lower envelope of the lines, which is shown as the thickened line in the ﬁgure. This results in z∗ = [1/5 4/5]T , and once again, the value is observed as ∗ = 3/5 (this must coincide with the

previous one because the randomized upper and lower values are the same!). ．

This procedure appears quite simple if there are only two actions per player. If n = m = 100, then the upper and lower envelopes are piecewise-linear functions in R99. This may be computationally impractical because all existing linear pro- gramming algorithms have running time at least exponential in dimension [264].

9.3 Two-Player Zero-Sum Games

9.3 Two-Player Zero-Sum Games

results matching ""

No results matching ""