9.5 Decision Theory Under Scrutiny

Numerous models for decision making were introduced in this chapter. These pro- vide a foundation for planning under uncertainty, which is the main focus of Part

III. Before constructing planning models with this foundation, it is important to critically assess how appropriate it may be in applications. You may have had many questions while reading Sections 9.1 to 9.4. How are the costs determined? Why should we believe that optimizing the expected cost is the right thing to do? What happens if prior probability distributions are not available? Is worst-case analysis too conservative? Can we be sure that players in a game will follow the assumed rational behavior? Is it realistic that players know each other’s cost func- tions? The purpose of this section is to help shed some light on these questions. A building is only as good as its foundation. Any mistakes made by misunderstand- ing the limitations of decision theory will ultimately work their way into planning formulations that are constructed from them.

      1. Utility Theory and Rationality

This section provides some justification for using cost functions and then minimiz- ing their expected value under Formulations 9.3 and 9.4. The resulting framework is called utility theory, which is usually formulated using rewards instead of costs. As stated in Section 9.1.1, a cost can be converted into a reward by multiplying by 1 and then swapping each maximization with minimization. We will therefore

talk about a reward R with the intuition that a higher reward is better.

        1. Comparing rewards

Imagine assigning reward values to various outcomes of a decision-making process. In some applications numerical values may come naturally. For example, the re- ward might be the amount of money earned in a financial investment. In robotics applications, one could negate time to execute a task or the amount of energy con- sumed. For example, the reward could indicate the amount of remaining battery life after a mobile robot builds a map.

In some applications the source of rewards may be subjective. For example, what is the reward for washing dishes, in comparison to sweeping the floor? Each person would probably assign different rewards, which may even vary from day to day. It may be based on their enjoyment or misery in performing the task, the amount of time each task would take, the perceptions of others, and so on. If decision theory is used to automate the decision process for a human “client,” then it is best to consult carefully with the client to make sure you know their preferences. In this situation, it may be possible to sort their preferences and then assign rewards that are consistent with the ordering.

Once the rewards are assigned, consider making a decision under Formulation 9.1, which does not involve nature. Each outcome corresponds directly to an

action, u ∈ U. If the rewards are given by R : U → R, then the cost, L, can be

defined as L(u) = R(u) for every u U. Satisfying the client is then a matter of choosing u to minimize L.

− ∈

Now consider a game against nature. The decision now involves comparing probability distributions over the outcomes. The space of all probability distri- butions may be enormous, but this is simplified by using expectation to map each probability distribution (or density) to a real value. The concern should be whether this projection of distributions onto real numbers will fail to reflect the true preferences of the client. The following example illustrates the effect of this.

Example 9.22 (Do You Like to Gamble?) Suppose you are given three choices:

          1. You can have 1000 Euros.
          2. We will toss an unbiased coin, and if the result is heads, then you will receive 2000 Euros. Otherwise, you receive nothing.
          3. With probability 2/3, you can have 3000 Euros; however, with probability 1/3, you have to give me 3000 Euros.

The expected reward for each of these choices is 1000 Euros, but would you really consider these to be equivalent? Your love or disdain for gambling is not being

taken into account by the expectation. How should such an issue be considered in games against nature? .

To begin to fix this problem, it is helpful to consider another scenario. Many people would probably agree that having more money is preferable (if having too much worries you, then you can always give away the surplus to your favorite char- ities). What is interesting, however, is that being wealthy decreases the perceived value of money. This is illustrated in the next example.

Example 9.23 (Reality Television) Suppose you are lucky enough to appear on a popular reality television program. The point of the show is to test how far you will go in making a fool out of yourself, or perhaps even torturing yourself, to earn some money. You are asked to do some unpleasant task (such as eating cockroaches, or holding your head under water for a long time, and so on.). Let u1 be the action to agree to do the task, and let u2 mean that you decline the opportunity. The prizes are expressed in U.S. dollars. Imagine that you are a starving student on a tight budget.

Below are several possible scenarios that could be presented on the television program. Consider how you would react to each one.

  1. Suppose that u1 earns you $1 and u2 earns you nothing. Purely optimizing the reward would lead to choosing u1, which means performing the unpleas- ant task. However, is this worth $1? The problem so far is that we are not taking into account the amount of discomfort in completing a task. Perhaps it might make sense to make a reward function that shifts the dollar values

by subtracting the amount for which you would be just barely willing to perform the task.

  1. Suppose that u1 earns you $10,000 and u2 earns you nothing. $10,000 is assumed to be an enormous amount of money, clearly worth enduring any torture inflicted by the television program. Thus, u1 is preferable.
  2. Now imagine that the television host first gives you $10 million just for appearing on the program. Are you still willing to perform the unpleasant task for an extra $10,000? Probably not. What is happening here? Your sense of value assigned to money seems to decrease as you get more of it, right? It would not be too interesting to watch the program if the contestants were all wealthy oil executives.
  3. Suppose that you have performed the task and are about to win the prize. Just to add to the drama, the host offers you a gambling opportunity. You can select action u1 and receive $10,000, or be a gambler by selecting u2 and have probability 1/2 of winning $25,000 by the tossing of a fair coin. In terms of the expected reward, the clear choice is u2. However, you just completed the unpleasant task and expect to earn money. The risk of losing it all may be intolerable. Different people will have different preferences in this situation.
  4. Now suppose once again that you performed the task. This time your choices are u1, to receive $100, or u2, to have probability 1/2 of receiving $250 by tossing a fair coin. The host is kind enough, though, to let you play 100 times. In this case, the expected totals for the two actions are $10,000 and

$12,500, respectively. This time it seems clear that the best choice is to gamble. After 100 independent trials, we would expect that, with extremely high probability, over $10,000 would be earned. Thus, reasoning by expected- case analysis seems valid if we are allowed numerous, independent trials. In this case, with high probability a value close to the expected reward should be received.

Based on these examples, it seems that the client or evaluator of the decision- making system must indicate preferences between probability distributions over outcomes. There is a formal way to ensure that once these preferences are assigned, a cost function can be designed for which its expectation faithfully reflects the preferences over distributions. This results in utility theory, which involves the following steps:

  1. Require that the client is rational when assigning preferences. This notion is defined through axioms.
  2. If the preferences are assigned in a way that is consistent with the axioms, then a utility function is guaranteed to exist. When expected utility is optimized, the preferences match exactly those of the client.
  3. The cost function can be derived from the utility function.

The client must specify preferences among probability distributions of out- comes. Suppose that Formulation 9.2 is used. For convenience, assume that U and Θ are finite. Let X denote a state space based on outcomes.5 Let f : U Θ X denote a mapping that assigns a state to every outcome. A simple example is to declare that X = U Θ and make f the identity map. This makes the outcome space and state space coincide. It may be convenient, though, to use f to collapse the space of outcomes down to a smaller set. If two outcomes map to the same state using f, then it means that the outcomes are indistinguishable as far as rewards or costs are concerned.

×

× →

Let z denote a probability distribution over X, and let Z denote the set of all probability distributions over X. Every z Z is represented as an n-dimensional vector of probabilities in which n = X ; hence, it is considered as an element of Rn.

| |

This makes it convenient to “blend” two probability distributions. For example, let α ∈ (0, 1) be a constant, and let z1 and z2 be any two probability distributions. Using scalar multiplication, a new probability distribution, αz1 + (1 − α)z2, is

obtained, which is a blend of z1 and z2. Conveniently, there is no need to normalize the result. It is assumed that z1 and z2 initially have unit magnitude. The blend has magnitude α + (1 α) = 1.

The modeler of the decision process must consult the client to represent prefer- ences among elements of Z. Let z1 ≺ z2 mean that z2 is strictly preferred over z1. Let z1 ≈ z2 mean that z1 and z2 are equivalent in preference. Let z1 三 z2 mean

that either z1 z2 or z1 z2. The following example illustrates the assignment of preferences.

≺ ≈

Example 9.24 (Indicating Preferences) Suppose that U = Θ = 1, 2 , which leads to four possible outcomes: (1, 1), (1, 2), (2, 1), and (2, 2). Imagine that nature represents a machine that generates 1 or 2 according to a probability distribution. The action is to guess the number that will be generated by the machine. If you pick the same number, then you win that number of gold pieces. If you do not pick the same number, then you win nothing, but also lose nothing.

{ }

Consider the construction of the state space X by using f. The outcomes (2, 1) and (1, 2) are identical concerning any conceivable reward. Therefore, these should map to the same state. The other two outcomes are distinct. The state space therefore needs only three elements and can be defined as X = 0, 1, 2 . Let f(2, 1) = f(1, 2) = 0, f(1, 1) = 1, and f(2, 2) = 2. Thus, the last two states indicate that some gold will be earned.

{ }

The set Z of probability distributions over X is now considered. Each z ∈ Z is

a three-dimensional vector. As an example, z1 = [1/2 1/4 1/4] indicates that the

5In most utility theory literature, this is referred to as a reward space, R [89].

state will be 0 with probability 1/2, 1 with probability 1/4, and 2 with probability 1/4. Suppose z2 = [1/3 1/3 1/3]. Which distribution would you prefer? It seems in this case that z2 is uniformly better than z1 because there is a greater chance

of winning gold. Thus, we declare z1 ≺ z2. The distribution z3 = [1 0 0] seems

to be the worst imaginable. Hence, we can safely declare z3 z1 and z1 z2.

≺ ≺

The procedure of determining the preferences can become quite tedious for complicated problems. In the current example, Z is a 2D subset of R3. This subset can be partitioned into a finite set of regions over which the client may be

able to clearly indicate preferences. One of the major criticisms of this framework is the impracticality of determining preferences over Z [831].

After the preferences are determined, is there a way to ensure that a real-value function on X exists for which the expected value exactly reflects the preferences?

If the axioms of rationality are satisfied by the assignment of preferences, then the answer is yes. These axioms are covered next. .

        1. Axioms of rationality

To meet the goal of designing a utility function, it turns out that the preferences must follow rules called the axioms of rationality. They are sensible statements of consistency among the preferences. As long as these are followed, then a utility function is guaranteed to exist (detailed arguments appear in [268, 831]). The decision maker is considered rational if the following axioms are followed when

defining ≺ and ≈:

6

          1. If z1, z2 Z, then either z1 z2 or z2 z1. “You must be able to make up your mind.”

∈ 三 三

          1. If z1 z2 and z2 z3, then z1 z3. “Preferences must be transitive.”

三 三 三

          1. If z1 ≺ z2, then

αz1 + (1 − α)z3 ≺ αz2 + (1 − α)z3, (9.84)

for any z3 Z and α (0, 1).

∈ ∈

“Evenly blending in a new distribution does not alter preference.”

          1. If z1 ≺ z2 ≺ z3, then there exists some α ∈ (0, 1) and β ∈ (0, 1) such that

αz1 + (1 − α)z3 ≺ z2 (9.85)

and

z2 ≺ βz1 + (1 − β)z3. (9.86)

“There is no heaven or hell.”

6Alternative axiom systems exist [268, 839].

Each axiom has an intuitive interpretation that makes practical sense. The first one simply indicates that the preference direction can always be inferred for a pair of distributions. The second axiom indicates that preferences must be transitive.7 The last two axioms are somewhat more complicated. In the third axiom, z2 is strictly preferred to z1. An attempt is made to cause confusion by blending in a third distribution, z3. If the same “amount” of z3 is blended into both z1 and z2, then the preference should not be affected. The final axiom involves z1, z2, and z3, each of which is strictly better than its predecessor. The first equation, (9.85), indicates that if z2 is strictly better than z1, then a tiny amount of z3 can be blended into z1, with z2 remaining preferable. If z3 had been like “heaven” (i.e., infinite reward), then this would not be possible. Similarly, (9.86) indicates that a tiny amount of z1 can be blended into z3, and the result remains better than z2. This means that z1 cannot be “hell,” which would have infinite negative reward.8

        1. Constructing a utility function

If the preferences have been determined in a way consistent with the axioms, then it can be shown that a utility function always exists. This means that there exists

a function U : X → R such that, for all z1, z2 ∈ Z,

z1 ≺ z2 if and only if E_z_1 [U] < E_z_2 [U], (9.87)

in which Ezi denotes the expected value of U, which is being treated as a random variable under the probability distribution zi. The existence of implies that it is safe to determine the best action by maximizing the expected utility.

U

A reward function can be defined using a utility function, , as R(u, θ) = (f(u, θ)). The utility function can be converted to a cost function as L(u, θ) = R(u, θ) = (f(u, θ)). Minimizing the expected cost, as was recommended under Formulations 9.3 and 9.4 with probabilistic uncertainty, now seems justified

− −U

U

U

under the assumption that U was constructed correctly to preserve preferences.

Unfortunately, establishing the existence of a utility function does not produce a systematic way to construct it. In most circumstances, one is forced to design by a trial-and-error process that involves repeatedly checking the preferences. In the vast majority of applications, people create utility and cost functions without regard to the implications discussed in this section. Thus, undesirable conclusions may be reached in practice. Therefore, it is important not to be too confident about the quality of an optimal decision rule.

U

Note that if worst-case analysis had been used, then most of the problems discussed here could have been avoided. Worst-case analysis, however, has its weaknesses, which will be discussed in Section 9.5.3.

7For some reasonable problems, however, transitivity is not desirable. See the Candorcet and Simpson paradoxes in [831].

8Some axiom systems allow infinite rewards, which lead to utility and cost functions with

infinite values, but this is not considered here.

U(x)

x

Figure 9.6: The utility of new amounts of money decreases as the total accumula- tion of wealth increases. The utility function may even bounded.

Example 9.25 (The Utility of Money) We conclude the section by depicting a utility function that is often applied to money. Suppose that the state space X = R, which corresponds to the amount of U.S. dollars earned. The utility of

money applied by most people indicates that the value of new increments of money

decreases as the total accumulated wealth increases. The utility function may even be bounded. Imagine there is some maximum dollar amount, such as $10100, after

which additional money has no value. A typical utility curve is shown in Figure 9.6 [89]. .

      1. Concerns Regarding the Probabilistic Model

Section 9.5.1 addressed the source of cost functions and the validity of taking their expectations. This section raises concerns over the validity of the probability distributions used in Section 9.2. The two main topics are criticisms of Bayesian methods in general and problems with constructing probability distributions.

        1. Bayesians vs. frequentists

For the past century and a half, there has been a fundamental debate among statisticians on the meaning of probabilities. Virtually everyone is satisfied with the axioms of probability, but beyond this, what is their meaning when making inferences? The two main camps are the frequentists and the Bayesians. A form of Bayes’ rule was published in 1763 after the death of Bayes [80]. During most of the nineteenth century Bayesian analysis tended to dominate literature; however, during the twentieth century, the frequentist philosophy became more popular as a more rigorous interpretation of probabilities. In recent years, the credibility of Bayesian methods has been on the rise again.

As seen so far, a Bayesian interprets probabilities as the degree of belief in a hypothesis. Under this philosophy, it is perfectly valid to begin with a prior

distribution, gather a few observations, and then make decisions based on the resulting posterior distribution from applying Bayes’ rule.

From a frequentist perspective, Bayesian analysis makes far too liberal use of probabilities. The frequentist believes that probabilities are only defined as the quantities obtained in the limit after the number of independent trials tends to infinity. For example, if an unbiased coin is tossed over numerous trials, the probability 1/2 represents the value to which the ratio between heads and the total number of trials will converge as the number of trials tends to infinity. On the other hand, a Bayesian might say that the probability that the next trial results in heads is 1/2. To a frequentist, this interpretation of probability is too strong.

Frequentists have developed a version of decision theory based on their philos- ophy; comparisons between the two appear in [831]. As an example, a frequentist would advocate optimizing the following frequentist risk to obtain a decision rule:

R(θ, π) = r L(π(y), θ)P(y|θ)dy, (9.88)

y

|

in which π represents the strategy, π : Y U. The frequentist risk averages over all data, rather than making a decision based on a single observation, as advocated by Bayesians in (9.26). The probability P(y θ) is assumed to be obtained in the limit as the number of independent data trials tends to infinity. The main drawback in using (9.88) is that the optimization depends on θ. The resulting best decision rule must depend on θ, which is unknown. In some limited cases, it may be possible to select some π that optimizes (9.88) for all θ, but this rarely occurs. Thus, the frequentist risk can be viewed as a constraint on the desirability of strategies, but it usually is not powerful enough to select a single one. This problem is reminiscent of Pareto optimality, which was discussed in Section 9.1.1. The frequentist approach attempts to be more conservative and rigorous, with the result being that weaker statements are made regarding decisions.

        1. The source of prior distributions

Suppose that the Bayesian method has been adopted. The most widespread con- cern in all Bayesian analyses is the source of the prior distribution. In Section 9.2, this is represented as P(θ) (or p(θ)), which represents a distribution (or den- sity) over the nature action space. The best way to obtain P(θ) is by estimating the distribution over numerous independent trials. This brings its definition into alignment with frequentist views. This was possible with Example 9.11, in which P(θ) could be reliably estimated from the frequency of occurrence of letters across numerous pages of text. The distribution could even be adapted to a particular language or theme.

In most applications that use decision theory, however, it is impossible or too costly to perform such experiments. What should be done in this case? If a prior distribution is simply “made up,” then the resulting posterior probabilities may be suspect. In fact, it may be invalid to call them probabilities at all. Sometimes

the term subjective probabilities is used in this case. Nevertheless, this is com- monly done because there are few other options. One of these options is to resort to frequentist decision theory, but, as mentioned, it does not work with single observations.

Fortunately, as the number of observations increases, the influence of the prior on the Bayesian posterior distributions diminishes. If there is only one observation, or even none as in Formulation 9.3, then the prior becomes very influential. If there is little or no information regarding P(θ), the distribution should be designed as carefully as possible. It should also be understood that whatever conclusions are made with this assumption, they are biased by the prior. Suppose this model is used as the basis of a planning approach. You might feel satisfied computing the “optimal” plan, but this notion of optimality could still depend on some arbitrary initial bias due to the assignment of prior values.

If there is no information available, then it seems reasonable that P(θ) should be as uniform as possible over Θ. This was referred to by Laplace as the “principle of insufficient reason” [581]. If there is no reason to believe that one element is more likely than another, then they should be assigned equal values. This can also be justified by using Shannon’s entropy measure from information theory [49, 248, 864]. In the discrete case, this is

− P(θ) lg P(θ), (9.89)

-

θ∈Θ

and in the continuous case it is

p(θ) lg p(θ)dθ. (9.90)

− r

Θ

This entropy measure was developed in the context of communication systems to estimate the minimum number of bits needed to encode messages delivered through a noisy medium. It generally indicates the amount of uncertainty associated with the distribution. A larger value of entropy implies a greater amount of uncertainty. It turns out that the entropy function is maximized when P(θ) is a uniform distribution, which seems to justify the principle of insufficient reason. This can be considered as a noninformative prior. The idea is even applied quite frequently

when Θ = R, which leads to an improper prior. The density function cannot maintain a constant, nonzero value over all of R because its integral would be infinite. Since the decisions made in Section 9.2 do not depend on any normalizing

factors, a constant value can be assigned for p(θ) and the decisions are not affected by the fact that the prior is improper.

The main difficulty with applying the entropy argument in the selection of a prior is that Θ itself may be chosen in a number of arbitrary ways. Uniform as- signments to different choices of Θ ultimately yield different information regarding the priors. Consider the following example.

Example 9.26 (A Problem with Noninformative Priors) Consider a deci- sion about what activities to do based on the weather. Imagine that there is absolutely no information about what kind of weather is possible. One possible assignment is Θ = p, c , in which p means “precipitation” and c means “clear.” Maximizing (9.89) suggests assigning P(p) = P(c) = 1/2.

{ }

After thinking more carefully, perhaps we would like to distinguish between dif- ferent kinds of precipitation. A better set of nature actions would be Θ = r, s, c , in which c still means “clear,” but precipitation p has been divided into r for “rain” and s for “snow.” Now maximizing (9.89) assigns probability 1/3 to each nature action. This is clearly different from the original assignment. Now that we distinguish between different kinds of precipitation, it seems that precipitation is much more likely to occur. Does our preference to distinguish between different

{ }

forms of precipitation really affect the weather? .

Example 9.27 (Noninformitive Priors for Continuous Spaces) Similar trou- bles can result in continuous spaces. Recall the parameter estimation problem described in Example 9.12. Suppose instead that the task is to estimate a line based on some data points that were supposed to fall on the line but missed due to noise in the measurement process.

What initial probability density should be assigned to Θ, the set of all lines?

Suppose that the line lives in Z = R2. The line equation can be expressed as

θ1z1 + θ2z2 + θ3 = 0. (9.91)

The problem is that if the parameter vector, θ = [θ1 θ2 θ3], is multiplied by a scalar constant, then the same line is obtained. Thus, even though θ R3, a constraint must be added. Suppose we require that

θ2 + θ2 + θ1 = 1 (9.92)

1 2 3

and θ1 ≥ 0. This mostly fixes the problem and ensures that each parameter value

corresponds to a unique line (except for some duplicate cases at θ1 = 0, but these can be safely neglected here). Thus, the parameter space is the upper half of a sphere, S2. The maximum-entropy prior suggests assigning a uniform probability density to Θ. This may feel like the right thing to do, but this notion of uniformity is biased by the particular constraint applied to the parameter space to ensure uniqueness. There are many other choices. For example, we could replace (9.92) by constraints that force the points to lie on the upper half of the surface of cube,

instead of a sphere. A uniform probability density assigned in this new parameter space certainly differs from one over the sphere.

In some settings, there is a natural representation of the parameter space that is invariant to certain transformations. Section 5.1.4 introduced the notion of Haar measure. If the Haar measure is used as a noninformative prior, then a meaningful notion of uniformity may be obtained. For example, suppose that the parameter space is SO(3). Uniform probability mass over the space of unit quaternions,

as suggested in Example 5.14, is an excellent choice for a noninformative prior because it is consistent with the Haar measure, which is invariant to group op- erations applied to the events. Unfortunately, a Haar measure does not exist for

most spaces that arise in practice.9 .

        1. Incorrect assumptions on conditional distributions

One final concern is that many times even the distribution P(y θ) is incorrectly estimated because it is assumed arbitrarily to belong to a family of distributions. For example, it is often very easy to work with Gaussian densities. Therefore, it is tempting to assume that p(y θ) is Gaussian. Experiments can be performed to estimate the mean and variance parameters. Even though some best fit will be found, it does not necessarily imply that a Gaussian is a good representation. Con- clusions based on this model may be incorrect, especially if the true distribution has a different shape, such as having a larger tail or being multimodal. In many cases, nonparametric methods may be needed to avoid such biases. Such methods do not assume a particular family of distributions. For example, imagine estimat- ing a probability distribution by making a histogram that records the frequency of y occurrences for a fixed value of θ. The histogram can then be normalized to contain a representation of the probability distribution without assuming an initial form.

|

|

      1. Concerns Regarding the Nondeterministic Model

Given all of the problems with probabilistic modeling, it is tempting to abandon the whole framework and work strictly with the nondeterministic model. This only requires specifying Θ, without indicating anything about the relative likelihoods of various actions. Therefore, most of the complicated issues presented in Sections

9.5.1 and 9.5.2 vanish. Unfortunately, this advantage comes at a substantial price. Making decisions with worst-case analysis under the nondeterministic model has its own shortcomings. After considering the trade-offs, you can decide which is most appropriate for a particular application of interest.

The first difficulty is to ensure that Θ is sufficiently large to cover all possi- bilities. Consider Formulation 9.6, in which nature acts twice. Through a nature observation action space, Ψ(θ), interference is caused in the measurement process. Suppose that Θ = R and h(θ, ψ) = θ + ψ. In this case, Ψ(θ) can be interpreted as the measurement error. What is the maximum amount of error that can oc- cur? Perhaps a sonar is measuring the distance from the robot to a wall. Based on the sensor specifications, it may be possible to construct a nice bound on the

error. Occasionally, however, the error may be larger than this bound. Sonars sometimes fail to hear the required echo to compute the distance. In this case the

9A locally compact topological group is required [346, 836].

reported distance is . Due to reflections, extremely large errors can sometimes occur. Although such errors may be infrequent, if we want guaranteed perfor- mance, then large or even infinite errors should be included in Ψ(θ). The problem is that worst-case reasoning could always conclude that the sensor is useless by reporting . Any statistically valid information that could be gained from the sensor would be ignored. Under the probabilistic model, it is easy to make Ψ(θ) quite large and then assign very small probabilities to larger errors. The problem with nondeterministic uncertainty is that Ψ(θ) needs to be smaller to make appro- priate decisions; however, theoretically “guaranteed” performance may not truly be guaranteed in practice.

Once a nondeterministic model is formulated, the optimal decision rule may produce results that seem absurd for the intended application. The problem is that the DM cannot tolerate any risk. An action is applied only if the result can be guaranteed. The hope of doing better than the worst case is not taken into account. Consider the following example:

Example 9.28 (A Problem with Conservative Decision Making) Suppose that a friend offers you the choice of either a check for 1000 Euros or 1 Euro in cash. With the check, you must take it to the bank, and there is a small chance that your friend will have insufficient funds in the account. In this case, you will receive nothing. If you select the 1 Euro in cash, then you are guaranteed to earn something.

The following cost matrix reflects the outcomes (ignoring utility theory):

U

Θ . (9.93)

1 1000
1 0

Using probabilistic analysis, we might conclude that it is best to take the check. Perhaps the friend is even known to be very wealthy and responsible with bank- ing accounts. This information, however, cannot be taken into account in the decision-making process. Using worst-case analysis, the optimal action is to take the 1 Euro in cash. You may not feel too good about it, though. Imagine the regret if you later learn that the account had sufficient funds to cash the check for

1000 Euros. .

Thus, it is important to remember the price that one must pay for wanting results that are absolutely guaranteed. The probabilistic model offers the flexibility of incorporating statistical information. Sometimes the probabilistic model can be viewed as a generalization of the nondeterministic model. If it is assumed that nature acts after the robot, then the nature action can take this into account, as incorporated into Formulation 9.4. In the nondeterministic case, Θ(u) is specified, and in the probabilistic case, P(θ u) is specified. The distribution P(θ u) can be designed so that nature selects with very high probability the θ Θ that maximizes L(u, θ). In Example 9.28, this would mean that the probability that

| |

the check would bounce (resulting in no earnings) would by very high, such as 0.999999. In this case, even the optimal action under the probabilistic model is to select the 1 Euro in cash. For virtually any decision problem that is modeled using worst-case analysis, it is possible to work backward and derive possible priors for which the same decision would be made using probabilistic analysis. In Example 9.4, it seemed as if the decision was based on assuming that with very high probability, the check would bounce, even though there were no probabilistic models.

This means that worst-case analysis under the nondeterministic model can be considered as a special case of a probabilistic model in which the prior distribution assigns high probabilities to the worst-case outcomes. The justification for this could be criticized in the same way that other prior assignments are criticized in Bayesian analysis. What is the basis of this particular assignment?

      1. Concerns Regarding Game Theory

One of the most basic limitations of game theory is that each player must know the cost functions of the other players. As established in Section 9.5.1, it is even quite difficult to determine an appropriate cost function for a single decision maker. It is even more difficult to determine costs and motivations of other players. In most practical settings this information is not available. One possibility is to model uncertainty associated with knowledge of the cost function of another player. Bayesian analysis could be used to reason about the cost based on observations of actions chosen by the player. Issues of assigning priors once again arise. One of the greatest difficulties in allowing uncertainties in the cost functions is that a kind of “infinite reflection” occurs [392]. For example, if I am playing a game, does the other player know my cost function? I may be uncertain about this. Furthermore, does the other player know that I do not completely know its cost function? This kind of second-guessing can occur indefinitely, leading to a nightmare of nested reasoning and assignments of prior distributions.10

The existence of saddle points or Nash equilibria was assured by using random- ized strategies. Mathematically, this appears to be a clean solution to a frustrat- ing problem; however, it also represents a substantial change to the model. Many games are played just once. For the expected-case results to converge, the game must be played an infinite number of times. If a game is played once, or only a few times, then the players are very likely to experience regret, even though the theory based on expected-case analysis indicates that regret is eliminated.

Another issue is that intelligent human players may fundamentally alter their strategies after playing a game several times. It is very difficult for humans to simulate a randomized strategy (assuming they even want to, which is unlikely). There are even international tournaments in which the players repeatedly engage

10Readers familiar with the movie The Princess Bride may remember the humorous dialog between Vizzini and the Dread Pirate Roberts about which goblet contains the deadly Iocane

powder.

in classic games such as Rock-Paper-Scissors or the Prisoner’s Dilemma. For an interesting discussion of a tournament in which people designed programs that repeatedly compete on the Prisoner’s Dilemma, see [917]. It was observed that even some cooperation often occurs after many iterations, which secures greater rewards for both players, even though they cannot communicate. A famous strategy arose in this context called Tit-for-Tat (written by Anatol Rapoport), which in each stage repeated the action chosen by the other player in the last stage. The approach is simple yet surprisingly successful.

In the case of nonzero-sum games, it is particularly disheartening that multiple Nash equilibria may exist. Suppose there is only one admissible equilibrium among several Nash equilibria. Does it really seem plausible that an adversary would think very carefully about the various Nash equilibria and pick the admissible one? Perhaps some players are conservative and even play security strategies, which completely destroys the assumptions of minimizing regret. If there are multiple admissible Nash equilibria, it appears that regret is unavoidable unless there is some collaboration between players. This result is unfortunate if such collaboration is impossible.

Further Reading

Section 9.1 covered very basic concepts, which can be found in numerous books and on the Internet. For more on Pareto optimality, see [847, 909, 953, 1005]. Section 9.2 is inspired mainly by decision theory books. An excellent introduction is [89]. Other sources include [268, 271, 673, 831]. The “game against nature” view is based mainly on [109]. Pattern classification, which is an important form of decision theory, is covered in [19, 271, 295, 711]. Bayesian networks [778] are a popular representation in artificial intelligence research and often provide compact encodings of information for complicated decision-making problems.

Further reading on the game theory concepts of Sections 9.3 and 9.4 can be found in many books (e.g., [59, 759]). A fun book that has many examples and intuitions is [917]. For games that have infinite action sets, see [59]. The computation of randomized Nash equilibria remains a topic of active research. A survey of methods appears in [691]; see also [545, 699]. The coupled polynomial equations that appear in computing randomized Nash equilibria may seem to suggest applying algorithms from computational algebraic geometry, as were needed in Section 6.4 to solve this kind of problem in combinatorial motion planning. An approach that uses such tools is given in [261]. Contrary to the

noncooperative games defined in Section 9.4, cooperative game theory investigates ways

in which various players can form coalitions to improve their rewards [779].

Parts of Section 9.5 were inspired by [89]. Utility theory appears in most decision theory books (e.g., [89]) and in some artificial intelligence books (e.g., [839]). An in- depth discussion of Bayesian vs. frequentist issues appears in [831]. For a thorough introduction to constructing cost models for decision making, see [539].

Exercises

        1. Suppose that a single-stage two-objective decision-making problem is defined in which there are two objectives and a continuous set of actions, U = [−10, 10]. The cost vector is L = [u2 u − 1]. Determine the set of Pareto-optimal actions.
        2. Let

Θ

−1 3 2 −1
−1 0 7 −1
1 5 5 −2

U

define the cost for each combination of choices by the decision maker and nature. Let nature’s randomized strategy be [1/5 2/5 1/10 3/10].

          1. Use nondeterministic reasoning to find the minimax decision and worst-case cost.
          2. Use probabilistic reasoning to find the best expected-case decision and ex- pected cost.
        • Many reasonable decision rules are possible, other than those considered in this chapter.
          1. Exercise 2(a) reflects extreme pessimism. Suppose instead that extreme op- timism is used. Select the choice that optimizes the best-case cost for the matrix in Exercise 2.
          2. One approach is to develop a coefficient of optimism, α ∈ [0, 1], which allows

one to interpolate between the two extreme scenarios. Thus, a decision,

u U , is chosen by minimizing

α max JL(u, θ)\ + (1 − α) min JL(u, θ)\. (9.94)

θ∈Θ

θ∈Θ

Determine the optimal decision for this scenario under all possible choices for α ∈ [0, 1]. Give your answer as a list of choices, each with a specified

range of α.

        1. Suppose that after making a decision, you observe the choice made by nature. How does the cost that you received compare with the best cost that could have been obtained if you chose something else, given this choice by nature? This

difference in costs can be considered as regret or minimum “Doh!”11 Psychologists

have argued that some people make choices based on minimizing regret. It reflects

how badly you wish you had done something else after making the decision.

          1. Develop an expression for the worst-case regret, and use it to make a minimax regret decision using the matrix from Exercise 2.

11In 2001, the Homer Simpson term“Doh!” was added to the Oxford English Dictionary as an expression of regret.

          1. Develop an expression for the expected regret, and use it to make a minimum expected regret decision.
        • Using the matrix from Exercise 2, consider the set of all probability distributions for nature. Characterize the set of all distributions for which the minimax decision and the best expected decision results in the same choice. This indicates how to provide reverse justification for priors.
        • Consider a Bayesian decision-theory scenario with cost function L. Show that the decision rule never changes if L(u, θ) is replaced by aL(u, θ)+ b, for any a > 0 and

b ∈ R.

        1. Suppose that there are two classes, Ω = {ω1, ω2}, with P (ω1) = P (ω2) = 2 . The observation space, Y , is R. Recall from probability theory that the normal (or Gaussian) probability density function is

1

1

p(y) = e σ 2π

−(yµ)2/_2σ_2

, (9.95)

in which µ denotes the mean and σ2 denotes the variance. Suppose that p(y|ω1) is a normal density in which µ = 0 and σ2 = 1. Suppose that p(y|ω2) is a normal density in which µ = 6 and σ2 = 4. Find the optimal classification rule, γ : Y → Ω.

You are welcome to solve the problem numerically (by computer) or graphically (by careful function plotting). Carefully explain how you arrived at the answer in any case.

        1. Let

Θ

2 −2 −2 1
−1 −2 −2 6
4 0 −3 4

U

give the cost for each combination of choices by the decision maker and nature. Let nature’s randomized strategy be [1/4 1/2 1/8 1/8].

          1. Use nondeterministic reasoning to find the minimax decision and worst-case cost.
          2. Use probabilistic reasoning to find the best expected-case decision and ex- pected cost.
          3. Characterize the set of all probability distributions for which the minimax decision and the best expected decision results in the same choice.
        • In a constant-sum game, the costs for any u U and v V add to yield

L1(u, v) + L2(u, v) = c (9.96)

for some constant c that is independent of u and v. Show that any constant-sum game can be transformed into a zero-sum game, and that saddle point solutions can be found using techniques for the zero-sum formulation.

        1. Formalize Example 9.7 as a zero-sum game, and compute security strategies for the players. What is the expected value of the game?
        2. Suppose that for two zero-sum games, there exists some nonzero c ∈ R for which the cost matrix of one game is obtained by multiplying all entries by c in the cost matrix of the other. Prove that these two games must have the same deterministic and randomized saddle points.
        3. In the same spirit as Exercise 11, prove that two zero-sum games have the same deterministic and randomized saddle points if c is added to all matrix entries.
        4. Prove that multiple Nash equilibria of a nonzero-sum game specified by matrices

A and B are interchangeable if (A, B) as a game yields the same Nash equilibria as the game (A, A).

        1. Analyze the game of Rock-Paper-Scissors for three players. For each player, assign a cost of 1 for losing, 0 for a tie, and −1 for winning. Specify the cost functions.

Is it possible to avoid regret? Does it have a deterministic Nash equilibrium? Can you find a randomized Nash equilibrium?

        1. Compute the randomized equilibrium point for the following zero-sum game:

V

U . (9.97)

0 -1
-1 2

Indicate the randomized strategies for the players and the resulting expected value of the game.

Implementations

        1. Consider estimating the value of an unknown parameter, θ ∈ R. The prior prob- ability density is a normal,

1

p(θ) = e σ 2π

−(θµ)2/_2σ_2

, (9.98)

with µ = 0 and σ = 4. Suppose that a sequence, y1, y2, . . ., yk, of k observations is made and that each p(yi|θ) is a normal density with µ = θ and σ = 9. Suppose

that u represents your guess of the parameter value. The task is select u to minimize the expectation of the cost, L(u, θ) = (u θ)2. Suppose that the “true”

value of θ is 4. Determine the u∗, the minimal action with respect to the expected cost after observing: yi = 4 for every i ∈ {1, . . . , k}.

          1. Determine u∗ for k = 1.
          2. Determine u∗ for k = 10.
          3. Determine u∗ for k = 1000.

This experiment is not very realistic because the observations should be generated by sampling from the normal density, p(yi|θ). Repeat the exercise using values

drawn with the normal density, instead of yk = 4, for each k.

        1. Implement an algorithm that computes a randomized saddle point for zero-sum games. Assume that one player has no more than two actions and the other may have any finite number of actions.
        2. Suppose that a K-stage decision-making problem is defined using multiple objec- tives. There is a finite state space X and a finite action set U (x) for each x X.

A state transition equation, xk+1 = f (xk, uk), gives the next state from a current state and input. There are N cost functionals of the form

K

-

Li(u1, . . . , uK ) = l(xk, uk) + lF (xF ), (9.99)

k=1

in which F = K + 1. Assume that lF (xF ) = ∞ if xFXgoal (for some goal region

XgoalX) and lF (xF ) = 0 otherwise. Assume that there is no termination action

(which simplifies the problem). Develop a value-iteration approach that finds the complete set of Pareto-optimal plans efficiently as possible. If two or more plans produce the same cost vector, then only one representative needs to be returned.

results matching ""

    No results matching ""