Preliminary Concepts
- Optimization
- Optimizing a single objective
- Optimization
Before progressing to complicated decision-making models, first consider the sim- ple case of a single decision maker that must make the best decision. This leads to a familiar optimization problem, which is formulated as follows.
Formulation 9.1 (Optimization)
- A nonempty set U called the action space. Each u U is referred to as an
∈
action.
- A function L : U → R ∪ {∞} called the cost function.
Compare Formulation 9.1 to Formulation 2.2. State space, X, and state transition concepts are no longer needed because there is only one decision. Since there is no state space, there is also no notion of initial and goal states. A strategy simply consists of selecting the best action.
What does it mean to be the “best” action? If U is finite, then the best action,
u∗ ∈ U is
J \
u∗ = argmin L(u) . (9.1)
u∈U
If U is infinite, then there are different cases. Suppose that U = ( 1, 1) and L(u) = u. Which action produces the lowest cost? We would like to declare that 1 is the lowest cost, but 1 U. If we had instead defined U = [ 1, 1], then this would work. However, if U = ( 1, 1) and L(u) = u, then there is no action that produces minimum cost. For any action u U, a second one, u′ U, can always be chosen for which L(u′) < L(u). However, if U = ( 1, 1) and L(u) = u , then (9.1) correctly reports that u = 0 is the best action. There is no problem in this
−
− | |
∈ ∈
−
− − /∈ −
case because the minimum occurs in the interior, as opposed to on the boundary of U. In general it is important to be aware that an optimal value may not exist.
There are two ways to fix this frustrating behavior. One is to require that U is a closed set and is bounded (both were defined in Section 4.1). Since closed sets include their boundary, this problem will be avoided. The bounded condition prevents a problem such as optimizing U = R, and L(u) = u. What is the best u U? Smaller and smaller values can be chosen for u to produce a lower cost, even though R is a closed set.
∈
The alternative way to fix this problem is to define and use the notion of an
infimum, denoted by inf. This is defined as the largest lower bound that can be placed on the cost. In the case of U = (−1, 1) and L(u) = u, this is
u inf L(u) = −1. (9.2)
_,_1) J \∈(−1
The only difficulty is that there is no action u U that produces this cost. The infimum essentially uses the closure of U to evaluate (9.2). If U happened to be closed already, then u would be included in U. Unbounded problems can also be
∈
handled. The infimum for the case of U = R and L(u) = u is −∞.
As a general rule, if you are not sure which to use, it is safer to write inf in
the place were you would use min. The infimum happens to yield the minimum whenever a minimum exists. In addition, it gives a reasonable answer when no minimum exists. It may look embarrassing, however, to use inf in cases where it is obviously not needed (i.e., in the case of a finite U).
It is always possible to make an “upside-down” version of an optimization problem by multiplying L by 1. There is no fundamental change in the result, but sometimes it is more natural to formulate a problem as one of maximization instead of minimization. This will be done, for example, in the discussion of utility theory in Section 9.5.1. In such cases, a reward function, R, is defined instead of a cost function. The task is to select an action u U that maximizes the reward. It will be understood that a maximization problem can easily be converted into a minimization problem by setting L(u) = R(u) for all u U. For maximization
−
∈
− ∈
problems, the infimum can be replaced by the supremum, sup, which is the least upper bound on R(u) over all u ∈ U.
For most problems in this book, the selection of an optimal u U in a single decision stage is straightforward; planning problems are instead complicated by many other aspects. It is important to realize, however, that optimization itself is an extremely challenging if U and L are complicated. For example, U may be finite but extremely large, or U may be a high-dimensional (e.g., 1000) subset
∈
of Rn. Also, the cost function may be extremely difficult or even impossible to express in a simple closed form. If the function is simple enough, then standard
calculus tools based on first and second derivatives may apply. It most real-world applications, however, more sophisticated techniques are needed. Many involve a form of gradient descent and therefore only ensure that a local minimum is found. In many cases, sampling-based techniques are needed. In fact, many of the
sampling ideas of Section 5.2, such as dispersion, were developed in the context of optimization. For some classes of problems, combinatorial solutions may exist. For example, linear programming involves finding the min or max of a collection of linear functions, and many combinatorial approaches exist [259, 264, 664, 731]. This optimization problem will appear in Section 9.4.
Given the importance of sampling-based and combinatorial methods in opti- mization, there are interesting parallels to motion planning. Chapters 5 and 6 each followed these two philosophies, respectively. Optimal motion planning actually corresponds to an optimization problem on the space of paths, which is extremely difficult to characterize. In some special cases, as in Section 6.2.4, it is possible to find optimal solutions, but in general, such problems are extremely challenging. Calculus of variations is a general approach for addressing optimization problems over a space of paths that must satisfy differential constraints [841]; this will be covered in Section 13.4.1.
- Multiobjective optimization
Suppose that there is a collection of cost functions, each of which evaluates an action. This leads to a generalization of Formulation 9.1 to multiobjective opti- mization.
Formulation 9.2 (Multiobjective Optimization)
- A nonempty set U called the action space. Each u U is referred to as an
∈
action.
- A vector-valued cost function of the form L : U → Rd for some integer d. If desired, ∞ may also be allowed for any of the cost components.
A version of this problem was considered in Section 7.7.2, which involved the optimal coordination of multiple robots. Two actions, u and u′, are called equiva- lent if L(u) = L(u′). An action u is said to dominate an action u′ if they are not equivalent and Li(u) Li(u′) for all i such that 1 i d. This defines a partial ordering, , on the set of actions. Note that many actions may be incomparable. An action is called Pareto optimal if it is not dominated by any others. This means that it is minimal with respect to the partial ordering.
≤
≤ ≤ ≤
Example 9.1 (Simple Example of Pareto Optimality) Suppose that U = 1, 2, 3, 4, 5 and d = 2. The costs are assigned as L(1) = (4, 0), L(2) = (3, 3),
{ }
L(3) = (2, 2), L(4) = (5, 7), and L(5) = (9, 0). The actions 2, 4, and 5 can be eliminated because they are dominated by other actions. For example, (3, 3) is
dominated by (2, 2); hence, action u = 3 is preferable to u = 2. The remaining two actions, u = 1 and u = 3, are Pareto optimal. .
Based on this simple example, the notion of Pareto optimality seems mostly aimed at discarding dominated actions. Although there may be multiple Pareto- optimal solutions, it at least narrows down U to a collection of the best alternatives.
Example 9.2 (Pennsylvania Turnpike) Imagine driving across the state of Pennsylvania and being confronted with the Pennsylvania Turnpike, which is a toll highway that once posted threatening signs about speed limits and the according fines for speeding. Let U = 50, 51, . . . , 100 represent possible integer speeds, expressed in miles per hour (mph). A posted sign indicates that the speeding fines are 1) $50 for being caught driving between 56 and 65 mph, 2) $100 for being caught between 66 and 75, 3) $200 between 76 and 85, and 4) $500 between 86 and 100. Beyond 100 mph, it is assumed that the penalty includes jail time, which is so severe that it will not be considered.
{ }
The two criteria for a driver are 1) the time to cross the state, and 2) the amount of money spent on tickets. It is assumed that you will be caught violating the speed limit. The goal is to minimize both. What are the resulting Pareto- optimal driving speeds? Compare driving 56 mph to driving 57 mph. Both cost the same amount of money, but driving 57 mph takes less time. Therefore, 57 mph dominates 56 mph. In fact, 65 mph dominates all speeds down to 56 mph because the cost is the same, and it reduces the time the most. Based on this argument, the Pareto-optimal driving speeds are 55, 65, 75, 85, and 100. It is up to the individual drivers to decide on the particular best action for them; however,
it is clear that no speeds outside of the Pareto-optimal set are sensible. .
The following example illustrates the main frustration with Pareto optimal- ity. Removing nondominated solutions may not be useful enough. In come cases, there may even be a continuum of Pareto-optimal solutions. Therefore, the Pareto- optimal concept is not always useful. Its value depends on the particular applica- tion.
Example 9.3 (A Continuum of Pareto-Optimal Solutions) Let U = [0, 1] and d = 2. Let L(u) = (u, 1 u). In this case, every element of U is Pareto optimal. This can be seen by noting that a slight reduction in one criterion causes
−
an increase in the other. Thus, any two actions are incomparable. .
- Probability Theory Review
This section reviews some basic probability concepts and introduces notation that will be used throughout Part III.
Probability space A probability space is a three-tuple, (S, , P), in which the three components are
F
- Sample space: A nonempty set S called the sample space, which represents all possible outcomes.
- Event space: A collection of subsets of S, called the event space. If S is discrete, then usually = pow(S). If S is continuous, then is usually a sigma-algebra on S, as defined in Section 5.1.3.
F F
F
- Probability function: A function, P : R, that assigns probabili- ties to the events in . This will sometimes be referred to as a probability distribution over S.
F
F →
The probability function, P, must satisfy several basic axioms:
1. P(E) ≥ 0 for all E ∈ F. 2. P(S) = 1.
3. P(E ∪ F) = P(E) + P(F) if E ∩ F = ∅, for all E, F ∈ F.
If S is discrete, then the definition of P over all of can be inferred from its definition on single elements of S by using the axioms. It is common in this case
F
to write P(s) for some s ∈ S, which is slightly abusive because s is not an event. It technically should be P({s}) for some {s} ∈ F.
Example 9.4 (Tossing a Die) Consider tossing a six-sided cube or die that has numbers 1 to 6 painted on its sides. When the die comes to rest, it will always show one number. In this case, S = 1, 2, 3, 4, 5, 6 is the sample space. The event space is pow(S), which is all 26 subsets of S. Suppose that the probability function is assigned to indicate that all numbers are equally likely. For any individual s S, P( s ) = 1/6. The events include all subsets so that any probability statement
{ }
{ }
∈
can be formulated. For example, what is the probability that an even number is obtained? The event E = {2, 4, 6} has probability P(E) = 1/2 of occurring. .
The third probability axiom looks similar to the last axiom in the definition of a measure space in Section 5.1.3. In fact, P is technically a special kind of measure space as mentioned in Example 5.12. If S is continuous, however, this measure cannot be captured by defining probabilities over the singleton sets. The proba- bilities of singleton sets are usually zero. Instead, a probability density function,
p : S → R, is used to define the probability measure. The probability function, P, for any event E ∈ F can then be determined via integration:
P(E) = rE p(x)dx, (9.3)
in which x E is the variable of integration. Intuitively, P indicates the total probability mass that accumulates over E.
∈
Conditional probability A conditional probability is expressed as P(E F) for any two events E, F and is called the “probability of E, given F.” Its definition is
∈ F
|
∩
P(E F)
P(E|F) = P(F) . (9.4)
Two events, E and F, are called independent if and only if P(E F) = P(E)P(F); otherwise, they are called dependent. An important and sometimes misleading con- cept is conditional independence. Consider some third event, G . It might be the case that E and F are dependent, but when G is given, they become indepen- dent. Thus, P(E F) = P(E)P(F); however, P(E F G) = P(E G)P(F G).
∩
∈ F
∩ / ∩ | | |
Such examples occur frequently in practice. For example, E might indicate some- one’s height, and F is their reading level. These will generally be dependent events because children are generally shorter and have a lower reading level. If we are given the person’s age as an event G, then height is no longer important. It seems intuitive that there should be no correlation between height and reading level once the age is given.
The definition of conditional probability, (9.4), imposes the constraint that
P(E ∩ F) = P(F)P(E|F) = P(E)P(F|E), (9.5)
which nicely relates P(E F) to P(F E). This results in Bayes’ rule, which is a convenient way to swap E and F:
| |
P(E F)P(F)
|
P(F|E) = P(E) . (9.6)
The probability distribution, P(F), is referred to as the prior, and P(F E) is the posterior. These terms indicate that the probabilities come before and after E is considered, respectively.
|
If all probabilities are conditioned on some event, G , then conditional Bayes’ rule arises, which only differs from (9.6) by placing the condition G on all probabilities:
∈ F
P(F E, G) = P(E|F, G)P(F|G). (9.7)
|
P(E|G)
Marginalization Let the events F1, F2, . . . , Fn be any partition of S. The prob- ability of an event E can be obtained through marginalization as
n
P(E) = P(E|Fi)P(Fi). (9.8)
-
i=1
One of the most useful applications of marginalization is in the denominator of Bayes’ rule. A substitution of (9.8) into the denominator of (9.6) yields
P(F|E) = P(E|F)P(F) . (9.9)
n
P(E|Fi)P(Fi)
-
i=1
This form is sometimes easier to work with because P(E) appears to be eliminated.
Random variables Assume that a probability space (S, , P) is given. A ran- dom variable1 X is a function that maps S into R. Thus, X assigns a real value to every element of the sample space. This enables statistics to be conveniently computed over a probability space. If S is already a subset of R, X may by default represent the identity function.
F
Expectation The expectation or expected value of a random variable X is de- noted by E[X]. It can be considered as a kind of weighted average for X, in which the weights are obtained from the probability distribution. If S is discrete, then
E[X] = X(s)P(s). (9.10)
-
s∈S
If S is continuous, then2
E[X] =
r
S
X(s)p(s)ds. (9.11)
One can then define conditional expectation, which applies a given condition to the probability distribution. For example, if S is discrete and an event F is given, then
-
E[X|F] = X(s)P(s|F). (9.12)
s∈S
Example 9.5 (Tossing Dice) Returning to Example 9.4, the elements of S are already real numbers. Hence, a random variable X can be defined by simply letting X(s) = s. Using (9.11), the expected value, E[X], is 3.5. Note that the expected value is not necessarily a value that is “expected” in practice. It is impossible to actually obtain 3.5, even though it is not contained in S. Suppose that the expected value of X is desired only over trials that result in numbers greater then
3. This can be described by the event F = {4, 5, 6}. Using conditional expectation, (9.12), the expected value is E[X|F] = 5.
Now consider tossing two dice in succession. Each element s S is expressed as s = (i, j) in which i, j 1, 2, 3, 4, 5, 6 . Since S R, the random variable needs to be slightly more interesting. One common approach is to count the sum
∈ { } /⊂
∈
of the dice, which yields X(s) = i + j for any s ∈ S. In this case, E[X] = 7. .
1This is a terrible name, which often causes confusion. A random variable is not “random,”
nor is it a “variable.” It is simply a function, X : S R. To make matters worse, a capital letter is usually used to denote it, whereas lowercase letters are usually used to denote functions.
→
2Using the language of measure theory, both definitions are just special cases of the Lebesgue integral. Measure theory nicely unifies discrete and continuous probability theory, thereby avoid- ing the specification of separate cases. See [346, 546, 836].
- Randomized Strategies
Up until now, any actions taken in a plan have been deterministic. The plans in Chapter 2 specified actions with complete certainty. Formulation 9.1 was solved by specifying the best action. It can be viewed as a strategy that trivially makes the same decision every time.
In some applications, the decision maker may not want to be predictable. To achieve this, randomization can be incorporated into the strategy. If U is discrete, a randomized strategy, w, is specified by a probability distribution, P(u), over U. Let W denote the set of all possible randomized strategies. When the strategy is applied, an action u U is chosen by sampling according to the probability distribution, P(u). We now have to make a clear distinction between defining the strategy and applying the strategy. So far, the two have been equivalent; however, a randomized strategy must be executed to determine the resulting action. If the
∈
strategy is executed repeatedly, it is assumed that each trial is independent of the actions obtained in previous trials. In other words, P(uk|ui) = P(uk), in which P(uk|ui) represents the probability that the strategy chooses action uk in
trial k, given that ui was chosen in trial i for some i < k. If U is continuous, then a randomized strategy may be specified by a probability density function, p(u). In decision-theory and game-theory literature, deterministic and randomized strategies are often referred to as pure and mixed, respectively.
Example 9.6 (Basing Decisions on a Coin Toss) Let U = a, b . A ran- domized strategy w can be defined as
{ }
- Flip a fair coin, which has two possible outcomes: heads (H) or tails (T).
- If the outcome is H, choose a; otherwise, choose b.
Since the coin is fair, w is defined by assigning P(a) = P(b) = 1/2. Each time the strategy is applied, it not known what action will be chosen. Over many trials,
however, it converges to choosing a half of the time. .
A deterministic strategy can always be viewed as a special case of a randomized strategy, if you are not bothered by events that have probability zero. A deter-
ministic strategy, ui ∈ U, can be simulated by a random strategy by assigning
P(u) = 1 if u = ui, and P(u) = 0 otherwise. Only with probability zero can different actions be chosen (possible, but not probable!).
Imagine using a randomized strategy to solve a problem expressed using For- mulation 9.1. The first difficulty appears to be that the cost cannot be predicted. If the strategy is applied numerous times, then we can define the average cost. As the number of times tends to infinity, this average would converge to the expected cost, denoted by L¯(w), if L is treated as a random variable (in addition to the cost function). If U is discrete, the expected cost of a randomized strategy w is
L¯(w) = - L(u)P(u) = - L(u)wi, (9.13)
u∈U
u∈U
u∈U
u∈U
in which wi is the component of w corresponding to the particular u U.
∈ ¯
An interesting question is whether there exists some w W such that L(w) < L(u), for all u U. In other words, do there exist randomized strategies that are better than all deterministic strategies, using Formulation 9.1? The answer is no because the best strategy is always to assign probability one to the action, u∗, that
∈
∈
minimizes L. This is equivalent to using a deterministic strategy. If there are two or more actions that obtain the optimal cost, then a randomized strategy could arbitrarily distribute all of the probability mass between these. However, there would be no further reduction in cost. Therefore, randomization seems pointless in this context, unless there are other considerations.
One important example in which a randomized strategy is of critical importance is when making decisions in competition with an intelligent adversary. If the problem is repeated many times, an opponent could easily learn any deterministic strategy. Randomization can be used to weaken the prediction capabilities of an opponent. This idea will be used in Section 9.3 to obtain better ways to play zero-sum games.
Following is an example that illustrates the advantage of randomization when repeatedly playing against an intelligent opponent.
Example 9.7 (Matching Pennies) Consider a game in which two players re- peatedly play a simple game of placing pennies on the table. In each trial, the players must place their coins simultaneously with either heads (H) facing up or tails (T) facing up. Let a two-letter string denote the outcome. If the outcome is HH or TT (the players choose the same), then Player 1 pays Player 2 one Peso; if the outcome is HT or TH, then Player 2 pays Player 1 one Peso. What happens if Player 1 uses a deterministic strategy? If Player 2 can determine the strategy, then he can choose his strategy so that he always wins the game. However, if Player 1 chooses the best randomized strategy, then he can expect at best to break even on average. What randomized strategy achieves this?
A generalization of this to three actions is the famous game of Rock-Paper- Scissors [958]. If you want to design a computer program that repeatedly plays this game against smart opponents, it seems best to incorporate randomization.
.