A Game Against Nature

      1. Modeling Nature

For the first time in this book, uncertainty will be directly modeled. There are two DMs:

Robot: This is the name given to the primary DM throughout the book. So far, there has been only one DM. Now that there are two, the name is

more important because it will be used to distinguish the DMs from each other.

Nature: This DM is a mysterious force that is unpredictable to the robot. It has its own set of actions, and it can choose them in a way that interferes with the achievements of the robot. Nature can be considered as a synthetic DM that is constructed for the purposes of modeling uncertainty in the decision-making or planning process.

Imagine that the robot and nature each make a decision. Each has a set of actions to choose from. Suppose that the cost depends on which actions are chosen by each. The cost still represents the effect of the outcome on the robot; however, the robot must now take into account the influence of nature on the cost. Since nature is unpredictable, the robot must formulate a model of its behavior. Assume that the robot has a set, U, of actions, as before. It is now assumed that nature also has a set of actions. This is referred to as the nature action space and is denoted by Θ. A nature action is denoted as θ Θ. It now seems appropriate to call U the robot action space; however, for convenience, it will often be referred to as the action space, in which the robot is implied.

This leads to the following formulation, which extends Formulation 9.1.

Formulation 9.3 (A Game Against Nature)

        1. A nonempty set U called the (robot) action space. Each u U is referred to as an action.

        1. A nonempty set Θ called the nature action space. Each θ Θ is referred to as a nature action.

        1. A function L : U × Θ → R ∪ {∞}, called the cost function.

The cost function, L, now depends on u ∈ U and θ ∈ Θ. If U and Θ are finite, then it is convenient to specify L as a |U| × |Θ| matrix called the cost matrix.

Example 9.8 (A Simple Game Against Nature) Suppose that U and Θ each contain three actions. This results in nine possible outcomes, which can be speci- fied by the following cost matrix:

Θ

1 −1 0
−1 2 −2
2 −1 1

U

The robot action, u U, selects a row, and the nature action, θ Θ, selects a column. The resulting cost, L(u, θ), is given by the corresponding matrix entry. .

∈ ∈

In Formulation 9.3, it appears that both DMs act at the same time; nature does not know the robot action before deciding. In many contexts, nature may know the robot action. In this case, a different nature action space can be defined

for every u ∈ U. This generalizes Formulation 9.3 to obtain:

Formulation 9.4 (Nature Knows the Robot Action)

  1. A nonempty set U called the action space. Each u U is referred to as an

action.

  1. For each u ∈ U, a nonempty set Θ(u) called the nature action space.
  2. A function L : U × Θ → R ∪ {∞}, called the cost function.

If the robot chooses an action u ∈ U, then nature chooses from Θ(u).

      1. Nondeterministic vs. Probabilistic Models

What is the best decision for the robot, given that it is engaged in a game against nature? This depends on what information the robot has regarding how nature chooses its actions. It will always be assumed that the robot does not know the precise nature action to be chosen; otherwise, it is pointless to define nature. Two alternative models that the robot can use for nature will be considered. From the robot’s perspective, the possible models are

Nondeterministic: I have no idea what nature will do.

Probabilistic: I have been observing nature and gathering statistics.

Under both models, it is assumed that the robot knows Θ in Formulation 9.3 or Θ(u) for all u U in Formulation 9.4. The nondeterministic and probabilistic terminology are borrowed from Erdmann [313]. In some literature, the term pos- sibilistic is used instead of nondeterministic. This is an excellent term, but it is unfortunately too similar to probabilistic in English.

Assume first that Formulation 9.3 is used and that U and Θ are finite. Under the nondeterministic model, there is no additional information. One reasonable approach in this case is to make a decision by assuming the worst. It can even be imagined that nature knows what action the robot will take, and it will spitefully choose a nature action that drives the cost as high as possible. This pessimistic view is sometimes humorously referred to as Murphy’s Law (“If anything can go wrong, it will.”) [111] or Sod’s Law. In this case, the best action, u∗ U, is selected as

u∗ = argmin J max JL(u, θ)\\. (9.14)

uU

uU

θ∈Θ

The action u∗ is the lowest cost choice using worst-case analysis. This is sometimes

referred to as a minimax solution because of the min and max in (9.14). If U or

Θ is infinite, then the min or max may not exist and should be replaced by inf or sup, respectively.

Worst-case analysis may seem too pessimistic in some applications. Perhaps the assumption that all actions in Θ are equally likely may be preferable. This can be handled as a special case of the probabilistic model, which is described next.

Under the probabilistic model, it is assumed that the robot has gathered enough data to reliably estimate P(θ) (or p(θ) if Θ is continuous). In this case, it is imagined that nature applies a randomized strategy, as defined in Section 9.1.3. It assumed that the applied nature actions have been observed over many trials, and in the future they will continue to be chosen in the same manner, as predicted by the distribution P(θ). Instead of worst-case analysis, expected-case analysis is used. This optimizes the average cost to be received over numerous independent

trials. In this case, the best action, u∗ ∈ U, is

u∗ = argmin JEθ 「L(u, θ)l\, (9.15)

uU

in which Eθ indicates that the expectation is taken according to the probability distribution (or density) over θ. Since Θ and P(θ) together form a probability space, L(u, θ) can be considered as a random variable for each value of u (it assigns a real value to each element of the sample space).3 Using P(θ), the expectation in (9.15) can be expressed as

Eθ [L(u, θ)] = L(u, θ)P(θ). (9.16)

-

θ∈Θ

Example 9.9 (Nondeterministic vs. Probabilistic) Return to Example 9.8. Let U = u1, u2, u3 represent the robot actions, and let Θ = θ1, θ2, θ3 represent the nature actions.

{ } { }

Under the nondeterministic model of nature, u∗ = u1, which results in L(u∗, θ) = 1 in the worst case using (9.14). Under the probabilistic model, let P(θ1) = 1/5, P(θ2) = 1/5, and P(θ3) = 3/5. To find the optimal action, (9.15) can be used. This involves computing the expected cost for each action:

Eθ [L(u1, θ)] = (1)1/5 + (−1)1/5 + (0)3/5 = 0

Eθ [L(u2, θ)] = (−1)1/5 + (2)1/5 + (−2)3/5 = −1

Eθ [L(u3, θ)] = (2)1/5 + (−1)1/5 + (1)3/5 = 4/5.

The best action is u∗ = u2, which produces the lowest expected cost, −1.

(9.17)

If the probability distribution had instead been P = [1/10 4/5 1/10], then u∗ = u1 would have been obtained. Hence the best decision depends on P(θ); if this information is statistically valid, then it enables more informed decisions to be made. If such information is not available, then the nondeterministic model may be more suitable.

×

3Alternatively, a random variable may be defined over U Θ, and conditional expectation would be taken, in which u is given.

It is possible, however, to assign P(θ) as a uniform distribution in the absence

of data. This means that all nature actions are equally likely; however, conclusions based on this are dangerous; see Section 9.5. .

In Formulation 9.4, the nature action space Θ(u) depends on u U, the robot action. Under the nondeterministic model, (9.14) simply becomes

u∗ = argmin J max L(u, θ)\. (9.18)

uU

θ∈Θ(u)

uU

θ∈Θ(u)

Unfortunately, these problems do not have a nice matrix representation because the size of Θ(u) can vary for different u U. In the probabilistic case, P(θ) is replaced by a conditional probability distribution P(θ u). Estimating this distri- bution requires observing numerous independent trials for each possible u U. The behavior of nature can now depend on the robot action; however, nature is still characterized by a randomized strategy. It does not adapt its strategy across multiple trials. The expectation in (9.16) now becomes

|

Eθ L(u, θ) =

「 l -

θ∈Θ(u)

L(u, θ)P(θ|u), (9.19)

which replaces P(θ) by P(θ|u).

Regret It is important to note that the models presented here are not the only accepted ways to make good decisions. In game theory, the key idea is to minimize “regret.” This is the feeling you get after making a bad decision and wishing that you could change it after the game is finished. Suppose that after you choose some u U, you are told which θ Θ was applied by nature. The regret is the amount of cost that you could have saved by picking a different action, given the nature action that was applied.

∈ ∈

For each combination of u ∈ U and θ ∈ Θ, the regret, T , is defined as

T (u, θ) = max L(u, θ) L(u′, θ) . (9.20)

J − \

u′∈U

For Formulation 9.3, if U and Θ are finite, then a Θ U regret matrix can be defined.

| | × | |

Suppose that minimizing regret is the primary concern, as opposed to the actual cost received. Under the nondeterministic model, the action that minimizes the worst-case regret is

u∗ = argmin J max JT (u, θ)\\. (9.21)

uU

uU

θ∈Θ

In the probabilistic model, the action that minimizes the expected regret is

u∗ = argmin JEθ 「T (u, θ)l\. (9.22)

uU

uU

The only difference with respect to (9.14) and (9.15) is that L has been replaced by T . In Section 9.3.2, regret will be discussed in more detail because it forms the basis of optimality concepts in game theory.

Example 9.10 (Regret Matrix) The regret matrix for Example 9.8 is

Θ

2 0 2
0 3 0
3 0 3

U

Using the nondeterministic model, u∗ = u1, which results in a worst-case regret of 2 using (9.21). Under the probabilistic model, let P(θ1) = P(θ2) = P(θ3) = 1/3. In this case, u∗ = u1, which yields the optimal expected regret, calculated as 1 using (9.22).

      1. Making Use of Observations

Formulations 9.3 and 9.4 do not allow the robot to receive any information (other than L) prior to making its decision. Now suppose that the robot has a sensor that it can check just prior to choosing the best action. This sensor provides an obser- vation or measurement that contains information about which nature action might be chosen. In some contexts, the nature action can be imagined as a kind of state that has already been selected. The observation then provides information about this. For example, nature might select the current temperature in Bangkok. An observation could correspond to a thermometer in Bangkok that takes a reading.

Formulating the problem Let Y denote the observation space, which is the set of all possible observations, y Y . For convenience, suppose that Y , U, and Θ are all discrete. It will be assumed as part of the model that some constraints on θ are known once y is given. Under the nondeterministic model a set Y (θ) Y is specified for every θ Θ. The set Y (θ) indicates the possible observations, given that the nature action is θ. Under the probabilistic model a conditional probability distribution, P(y θ), is specified. Examples of sensing models will be given in Section 9.2.4. Many others appear in Sections 11.1.1 and 11.5.1, although they are expressed with respect to a state space X that reduces to Θ in this section. As before, the probabilistic case also requires a prior distribution, P(Θ), to be given. This results in the following formulation.

|

Formulation 9.5 (A Game Against Nature with an Observation)

        1. A finite, nonempty set U called the action space. Each u U is referred to as an action.

        1. A finite, nonempty set Θ called the nature action space.
        2. A finite, nonempty set Y called the observation space.
        3. A set Y (θ) Y or probability distribution P(y θ) specified for every θ Θ. This indicates which observations are possible or probable, respectively, if θ is the nature action. In the probabilistic case a prior, P(θ), must also be specified.

⊆ | ∈

        1. A function L : U × Θ → R ∪ {∞}, called the cost function.

Consider solving Formulation 9.5. A strategy is now more complicated than simply specifying an action because we want to completely characterize the be- havior of the robot before the observation has been received. This is accomplished by defining a strategy as a function, π : Y U. For each possible observation, y Y , the strategy provides an action. We now want to search the space of possible strategies to find the one that makes the best decisions over all possible observations. In this section, Y is actually a special case of an information space, which is the main topic of Chapters 11 and 12. Eventually, a strategy (or plan) will be conditioned on an information state, which generalizes an observation.

Optimal strategies Now consider finding the optimal strategy, denoted by π∗, under the nondeterministic model. The sets Y (θ) for each θ Θ must be used to determine which nature actions are possible for each observation, y Y . Let Θ(y) denote this, which is obtained as

Θ(y) = {θ ∈ Θ | y ∈ Y (θ)}. (9.23)

The optimal strategy, π∗, is defined by setting

π∗(y) = argmin J max JL(u, θ)\\, (9.24)

uU

θ∈Θ(y)

uU

θ∈Θ(y)

for each y Y . Compare this to (9.14), in which the maximum was taken over all Θ. The advantage of having the observation, y, is that the set is restricted to Θ(y) Θ.

Under the probabilistic model, an operation analogous to (9.23) must be per- formed. This involves computing P(θ y) from P(y θ) to determine the information that y contains regarding θ. Using Bayes’ rule, (9.9), with marginalization on the denominator, the result is

| |

P(θ|y) = P(y|θ)P(θ) . (9.25)

  • P(y θ)P(θ)

|

θ∈Θ

To see the connection between the nondeterministic and probabilistic cases, define a probability distribution, P(y θ), that is nonzero only if y Y (θ) and use a uniform distribution for P(θ). In this case, (9.25) assigns nonzero probability to precisely the elements of Θ(y) as given in (9.23). Thus, (9.25) is just the

| ∈

probabilistic version of (9.23). The optimal strategy, π∗, is specified for each

y ∈ Y as

π∗(y) = argmin JEθ 「L(u, θ) III yl\ = argmin (- L(u, θ)P(θ|y)_ . (9.26)

uU

uU

θ∈Θ

uU

uU

θ∈Θ

This differs from (9.15) and (9.16) by replacing P(θ) with P(θ y). For each u, the expectation in (9.26) is called the conditional Bayes’ risk. The optimal strategy, π∗, always selects the strategy that minimizes this risk. Note that P(θ y) in (9.26) can be expressed using (9.25), for which the denominator (9.26) represents P(y) and does not depend on u; therefore, it does not affect the optimization. Due to this, P(y θ)P(θ) can be used in the place of P(θ y) in (9.26), and the same π∗ will be obtained. If the spaces are continuous, then probability densities are used in the place of all probability distributions, and the method otherwise remains the same.

|

|

| |

Nature acts twice A convenient, alternative formulation can be given by al- lowing nature to act twice:

  1. First, a nature action, θ ∈ Θ, is chosen but is unknown to the robot.
  2. Following this, a nature observation action is chosen to interfere with the robot’s ability to sense θ.

Let ψ denote a nature observation action, which is chosen from a nature observation action space, Ψ(θ). A sensor mapping, h, can now be defined that yields y = h(θ, ψ) for each θ Θ and ψ Ψ(θ). Thus, for each of the two kinds of nature actions, θ Θ and ψ Ψ, an observation, y = h(θ, ψ), is given. This yields an alternative way to express Formulation 9.5:

∈ ∈

∈ ∈

Formulation 9.6 (Nature Interferes with the Observation)

  1. A nonempty, finite set U called the action space.
  2. A nonempty, finite set Θ called the nature action space.
  3. A nonempty, finite set Y called the observation space.
  4. For each θ Θ, a nonempty set Ψ(θ) called the nature observation action space.

  1. A sensor mapping h : Θ × Ψ → Y .
  2. A function L : U × Θ → R ∪ {∞} called the cost function.

This nicely unifies the nondeterministic and probabilistic models with a single function h. To express a nondeterministic model, it is assumed that any ψ Ψ(θ) is possible. Using h,

Θ(y) = {θ ∈ Θ | ∃ψ ∈ Ψ(θ) such that y = h(θ, ψ)}. (9.27)

For a probabilistic model, a distribution P(ψ θ) is specified (often, this may reduce to P(ψ)). Suppose that when the domain of h is restricted to some θ Θ, then it forms an injective mapping from Ψ to Y . In other words, every nature observation action leads to a unique observation, assuming θ is fixed. Using P(ψ) and h, P(y θ) is derived as

|

|

P(y θ) = P(ψ θ) for the unique ψ such that y = h(θ, ψ).

| ( |

0 if no such ψ exists.

(9.28)

If the injective assumption is lifted, then P(ψ θ) is replaced by a sum over all ψ for which y = h(θ, ψ). In Formulation 9.6, the only difference between the nonde- terministic and probabilistic models is the characterization of ψ, which represents a kind of measurement interference. A strategy still takes the form π : Θ U. A hybrid model is even possible in which one nature action is modeled nondeter- ministically and the other probabilistically.

|

Receiving multiple observations Another extension of Formulation 9.5 is to allow multiple observations, y1, y2, . . ., yn, before making a decision. Each yi is assumed to belong to an observation space, Yi. A strategy, π, now depends on all observations:

π : Y1 × Y2 × · · · × Yn → U. (9.29)

Under the nondeterministic model, Yi(θ) is specified for each i and θ Θ. The set Θ(y) is replaced by

Θ(y1) ∩ Θ(y2) ∩ · · · ∩ Θ(yn) (9.30) in (9.24) to obtain the optimal action, π∗(y1, . . . , yn).

|

Under the probabilistic model, P(yi θ) is specified instead. It is often assumed

that the observations are conditionally independent given θ. This means for any yi, θ, and yj such that i /= j, P(yi|θ, yj ) = P(yi|θ). The condition P(θ|y) in (9.26) is replaced by P(θ|y1, . . . , yn). Applying Bayes’ rule, and using the conditional

independence of the yi’s given θ, yields

P(θ|y , . . . , y

P(y θ)P(y θ) P(y θ)P(θ)

) = . (9.31)

1| 2| · · · n|

1 n

P(y1, . . . , yn)

The denominator can be treated as a constant factor that does not affect the optimization. Therefore, it does not need to be explicitly computed unless the optimal expected cost is needed in addition to the optimal action.

Conditional independence allows a dramatic simplification that avoids the full specification of P(y θ). Sometimes the conditional independence assumption is used when it is incorrect, just to exploit this simplification. Therefore, a method that uses conditional independence of observations is often called naive Bayes.

|

      1. Examples of Optimal Decision Making

The framework presented so far characterizes statistical decision theory, which covers a broad range of applications and research issues. Virtually any context in which a decision must be made automatically, by a machine or a person following specified rules, is a candidate for using these concepts. In Chapters 10 through 12, this decision problem will be repeatedly embedded into complicated planning problems. Planning will be viewed as a sequential decision-making process that iteratively modifies states in a state space. Most often, each decision step will be simpler than what usually arises in common applications of decision theory. This is because planning problems are complicated by many other factors. If the decision step in a particular application is already too hard to solve, then an extension to planning appears hopeless.

It is nevertheless important to recognize the challenges in general that arise when modeling and solving decision problems under the framework of this section. Some examples are presented here to help illustrate its enormous power and scope.

        1. Pattern classification

An active field over the past several decades in computer vision and machine learn- ing has been pattern classification [271, 295, 711]. The general problem involves using a set of data to perform classifications. For example, in computer vision, the data correspond to information extracted from an image. These indicate observed features of an object that are used by a vision system to try to classify the object (e.g., “I am looking at a bowl of Vietnamese noodle soup”).

The presentation here represents a highly idealized version of pattern classi- fication. We will assume that all of the appropriate model details, including the required probability distributions, are available. In some contexts, these can be ob- tained by gathering statistics over large data sets. In many applications, however, obtaining such data is expensive or inaccessible, and classification techniques must be developed in lieu of good information. Some problems are even unsupervised, which means that the set of possible classes must also be discovered automatically. Due to issues such as these, pattern classification remains a challenging research field.

The general model is that nature first determines the class, then observations are obtained regarding the class, and finally the robot action attempts to guess the correct class based on the observations. The problem fits under Formulation

9.5. Let Θ denote a finite set of classes. Since the robot must guess the class,

U = Θ. A simple cost function is defined to measure the mismatch between u and

θ:

L(u, θ) = 0 if u = θ (correct classification

1 if u /= θ (incorrect classification) .

(9.32)

The nondeterministic model yields a cost of 1 if it is possible that a classification error can be made using action u. Under the probabilistic model, the expectation

of (9.32) gives the probability that a classification error will be made given an action u.

The next part of the formulation considers information that is used to make the classification decision. Let Y denote a feature space, in which each y Y is called a feature or feature vector (often y Rn). The feature in this context is just an observation, as given in Formulation 9.5. The best classifier or classification rule is a strategy π : Y U that provides the smallest classification error in the worst case or expected case, depending on the model.

A Bayesian classifier The probabilistic approach is most common in pattern classification. This results in a Bayesian classifier. Here it is assumed that P(y θ) and P(θ) are given. The distribution of features for a given class is indicated by P(y θ). The overall frequency of class occurrences is given by P(θ). If large, pre- classified data sets are available, then these distributions can be reliably learned. The feature space is often continuous, which results in a density p(y θ), even though P(θ) remains a discrete probability distribution. An optimal classifier, π∗, is designed according to (9.26). It performs classification by receiving a feature vector, y, and then declaring that the class is u = π∗(y). The expected cost using (9.32) is the probability of error.

|

|

|

Example 9.11 (Optical Character Recognition) An example of classification is given by a simplified optical character recognition (OCR) problem. Suppose that a camera creates a digital image of a page of text. Segmentation is first performed to determine the location of each letter. Following this, the individual letters must be classified correctly. Let Θ = A, B, C, D, E, F, G, H , which would ordinarily include all of the letters of the alphabet.

{ }

Suppose that there are three different image processing algorithms:

Shape extractor: This returns s = 0 if the letter is composed of straight edges only, and s = 1 if it contains at least one curve.

End counter: This returns e, the number of segment ends. For example,

O has none and X has four.

Hole counter: This returns h, the number of holes enclosed by the charac- ter. For example, X has none and O has one.

The feature vector is y = (s, e, h). The values that should be reported under ideal conditions are shown in Figure 9.1. These indicate Θ(s), Θ(e), and Θ(h). The intersection of these yields Θ(y) for any combination of s, e, and h.

Imagine doing classification under the nondeterministic model, with the as- sumption that the features always provide correct information. For y = (0, 2, 1), the only possible letter is A. For y = (1, 0, 2), the only letter is B. If each (s, e, h) is consistent with only one or no letters, then a perfect classifier can be constructed. Unfortunately, (0, 3, 0) is consistent with both E and F. In the worst case, the cost of using (9.32) is 1.

Shape 0 A E F H
1 B C D G
Ends 0 B D
1
2 A C G
3 F E
4 H
Holes 0 C E F G H
1 A D
2 B

Figure 9.1: A mapping from letters to feature values.

One way to fix this is to introduce a new feature. Suppose that an image pro- cessing algorithm is used to detect corners. These are places at which two segments meet at a right (90 degrees) angle. Let c denote the number of corners, and let the new feature vector be y = (s, e, h, c). The new algorithm nicely distinguishes E from F, for which c = 2 and c = 1, respectively. Now all letters can be correctly classified without errors.

Of course, in practice, the image processing algorithms occasionally make mis- takes. A Bayesian classifier can be designed to maximize the probability of success. Assume conditional independence of the observations, which means that the classi- fier can be considered naive. Suppose that the four image processing algorithms are run over a training data set and the results are recorded. In each case, the correct classification is determined by hand to obtain probabilities P(s θ), P(e θ), P(h θ), and P(c θ). For example, suppose that the hole counter receives the letter A as input. After running the algorithm over many occurrences of A in text, it may be determined that P(h = 1 θ = A) = 0.9, which is the correct answer. With smaller probabilities, perhaps P(h = 0 θ = A) = 0.09 and P(h = 2 θ = A) = 0.01. As- suming that the output of each image processing algorithm is independent given the input letter, a joint probability can be assigned as

|

| |

|

| | |

P(y|θ) = P(s, e, h, c| θ) = P(s|θ)P(e|θ)P(h|θ)P(c|θ). (9.33)

The value of the prior P(θ) can be obtained by running the classifier over large amounts of hand-classified text and recording the relative numbers of occurrences of each letter. It is interesting to note that some context-specific information can be incorporated. If the text is known to be written in Spanish, then P(θ) should be different than from text written in English. Tailoring P(θ) to the type of text that will appear improves the performance of the resulting classifier.

The classifier makes its decisions by choosing the action that minimizes the probability of error. This error is proportional to

P(s|θ)P(e|θ)P(h|θ)P(c|θ)P(θ), (9.34)

-

θ∈Θ

by neglecting the constant P(y) in the denominator of Bayes’ rule in (9.26). .

        1. Parameter estimation

Another important application of the decision-making framework of this section is parameter estimation [89, 268]. In this case, nature selects a parameter, θ Θ, and Θ represents a parameter space. Through one or more independent trials, some observations are obtained. Each observation should ideally be a direct measure- ment of Θ, but imperfections in the measurement process distort the observation. Usually, Θ Y , and in many cases, Y = Θ. The robot action is to guess the

parameter that was chosen by nature. Hence, U = Θ. In most applications, all of the spaces are continuous subsets of Rn. The cost function is designed to increase

as the error, Iu − θI, becomes larger.

Example 9.12 (Parameter Estimation) Suppose that U = Y = Θ = R. Na- ture therefore chooses a real-valued parameter, which is estimated. The cost of making a mistake is

L(u, θ) = (u − θ) . (9.35)

2

Suppose that a Bayesian approach is taken. The prior probability density p(θ) is given as uniform over an interval [a, b] R. An observation is received, but it is noisy. The noise can be modeled as a second action of nature, as described in Section 9.2.3. This leads to a density p(y θ). Suppose that the noise is modeled with a Gaussian, which results in

|

1

p(y|θ) = √2πσ2

e−(yθ)2/_2σ_2 , (9.36)

in which the mean is θ and the standard deviation is σ.

The optimal parameter estimate based on y is obtained by selecting u R to minimize

in which

∞ L(u, θ)p(θ|y)dθ, (9.37)

−∞

r

p(y θ)p(θ)

|

p(θ|y) = p(y) , (9.38)

by Bayes’ rule. The term p(y) does not depend on θ, and it can therefore be ignored in the optimization. Using the prior density, p(θ) = 0 outside of [a, b]; hence, the domain of integration can be restricted to [a, b]. The value of p(θ) = 1/(b a) is also a constant that can be ignored in the optimization. Using (9.36), this means that u is selected to optimize

b

r

L(u, θ)p(y θ)dθ, (9.39)

|

a

which can be expressed in terms of the standard error function, erf(x) (the integral from 0 to a constant, of a Gaussian density over an interval).

If a sequence, y1, . . ., yk, of independent observations is obtained, then (9.39) is replaced by

b

r

L(u, θ)p(y1|θ) · · · p(yk|θ)dθ. (9.40)

a

results matching ""

    No results matching ""