provided P(B) = 0. The notation E [X; B] means E [X1B ], where 1B (ω) is 1 if ω ∈ B and

0 otherwise. Another way of writing E [X; B] is

E [X; B] = X(ω)P({ω}).

ω∈B

(We will use the notation E [X; B] frequently.)

Note 1. Suppose we have two disjoint sets C and D. Let A1 = C, A2 = D, and Ai = … for

i ≥ 3. Then the Ai are pairwise disjoint and

∞

P(∪∞ Ai )

P(C ∪ D) = = P(Ai ) = P(C) + P(D) (2.1)

i=1

i=1

by De¬nition 2.2(3) and (4). Therefore De¬nition 2.2(4) holds when there are only two sets

instead of in¬nitely many, and a similar argument shows the same is true when there are an

arbitrary (but ¬nite) number of sets.

Now suppose A ‚ B. Let C = A and D = B ’ A, where B ’ A is de¬ned to be

B © Ac (this is frequently written B \ A as well). Then C and D are disjoint, and by (2.1)

P(B) = P(C ∪ D) = P(C) + P(D) ≥ P(C) = P(A).

The other equality we mentioned is proved by letting C = A and D = Ac . Then C and

D are disjoint, and

1 = P(„¦) = P(C ∪ D) = P(C) + P(D) = P(A) + P(Ac ).

Solving for P(Ac ), we have

P(Ac ) = 1 ’ P(A).

Note 2. Let us show the two de¬nitions of expectation are the same (in the discrete case).

Starting with the ¬rst de¬nition we have

EX = xP(X = x)

x

= x P({ω})

x {ω∈„¦:X(ω)=x}

= X(ω)P({ω})

x {ω∈„¦:X(ω)=x}

= X(ω)P({ω}),

ω∈„¦

8

and we end up with the second de¬nition.

Note 3. Suppose X can takes the values x1 , x2 , . . . and Y can take the values y1 , y2 , . . ..

Let Ai = {ω : X(ω) = xi } and Bj = {ω : Y (ω) = yj }. Then

X= xi 1Ai , Y= y j 1B j ,

i j

and so

XY = xi yi 1Ai 1Bj .

i j

Since 1Ai 1Bj = 1Ai ©Bj , it follows that

xi yj P(Ai © Bj ),

E [XY ] =

i j

assuming the double sum converges. Since X and Y are independent, Ai = (X = xi ) is

independent of Bj = (Y = yj ) and so

E [XY ] = xi yj P(Ai )P(Bj )

i j

= xi P(Ai ) yj P(Bj )

i j

= xi P(Ai )E Y

i

= (E X)(E Y ).

9

3. Conditional expectation.

Suppose we have 200 men and 100 women, 70 of the men are smokers, and 50 of

the women are smokers. If a person is chosen at random, then the conditional probability

that the person is a smoker given that it is a man is 70 divided by 200, or 35%, while the

conditional probability the person is a smoker given that it is a women is 50 divided by

100, or 50%. We will want to be able to encompass both facts in a single entity.

The way to do that is to make conditional probability a random variable rather

than a number. To reiterate, we will make conditional probabilities random. Let M, W be

man, woman, respectively, and S, S c smoker and nonsmoker, respectively. We have

P(S | M ) = .35, P(S | W ) = .50.

We introduce the random variable

(.35)1M + (.50)1W

and use that for our conditional probability. So on the set M its value is .35 and on the

set W its value is .50.

We need to give this random variable a name, so what we do is let G be the σ-¬eld

consisting of {…, „¦, M, W } and denote this random variable P(S | G). Thus we are going

to talk about the conditional probability of an event given a σ-¬eld.

What is the precise de¬nition?

De¬nition 3.1. Suppose there exist ¬nitely (or countably) many sets B1 , B2 , . . ., all hav-

ing positive probability, such that they are pairwise disjoint, „¦ is equal to their union, and

G is the σ-¬eld one obtains by taking all ¬nite or countable unions of the Bi . Then the

conditional probability of A given G is

P(A © Bi )

P(A | G) = 1Bi (ω).

P(Bi )

i

In short, on the set Bi the conditional probability is equal to P(A | Bi ).

Not every σ-¬eld can be so represented, so this de¬nition will need to be extended

when we get to continuous models. σ-¬elds that can be represented as in De¬nition 3.B are

called ¬nitely (or countably) generated and are said to be generated by the sets B1 , B2 , . . ..

Let™s look at another example. Suppose „¦ consists of the possible results when we

toss a coin three times: HHH, HHT, etc. Let F3 denote all subsets of „¦. Let F1 consist of

the sets …, „¦, {HHH, HHT, HT H, HT T }, and {T HH, T HT, T T H, T T T }. So F1 consists

of those events that can be determined by knowing the result of the ¬rst toss. We want to

let F2 denote those events that can be determined by knowing the ¬rst two tosses. This will

10

include the sets …, „¦, {HHH, HHT }, {HT H, HT T }, {T HH, T HT }, {T T H, T T T }. This is

not enough to make F2 a σ-¬eld, so we add to F2 all sets that can be obtained by taking

unions of these sets.

Suppose we tossed the coin independently and suppose that it was fair. Let us

calculate P(A | F1 ), P(A | F2 ), and P(A | F3 ) when A is the event {HHH}. First

the conditional probability given F1 . Let C1 = {HHH, HHT, HT H, HT T } and C2 =

{T HH, T HT, T T H, T T T }. On the set C1 the conditional probability is P(A©C1 )/P(C1 ) =

P(HHH)/P(C1 ) = 1 / 1 = 1 . On the set C2 the conditional probability is P(A©C2 )/P(C2 )

82 4

= P(…)/P(C2 ) = 0. Therefore P(A | F1 ) = (.25)1C1 . This is plausible “ the probability of

getting three heads given the ¬rst toss is 1 if the ¬rst toss is a heads and 0 otherwise.

4

Next let us calculate P(A | F2 ). Let D1 = {HHH, HHT }, D2 = {HT H, HT T }, D3

= {T HH, T HT }, D4 = {T T H, T T T }. So F2 is the σ-¬eld consisting of all possible unions

of some of the Di ™s. P(A | D1 ) = P(HHH)/P(D1 ) = 1 / 4 = 1 . Also, as above, P(A |

1

8 2

Di ) = 0 for i = 2, 3, 4. So P(A | F2 ) = (.50)1D1 . This is again plausible “ the probability

1

of getting three heads given the ¬rst two tosses is 2 if the ¬rst two tosses were heads and

0 otherwise.

What about conditional expectation? Recall E [X; Bi ] = E [X1Bi ] and also that

E [1B ] = 1 · P(1B = 1) + 0 · P(1B = 0) = P(B). Given a random variable X, we de¬ne

E [X; Bi ]

E [X | G] = 1Bi .

P(Bi )

i

This is the obvious de¬nition, and it agrees with what we had before because E [1A | G]

should be equal to P(A | G).

We now turn to some properties of conditional expectation. Some of the following

propositions may seem a bit technical. In fact, they are! However, these properties are

crucial to what follows and there is no choice but to master them.

Proposition 3.2. E [X | G] is G measurable, that is, if Y = E [X | G], then (Y > a) is a

set in G for each real a.

Proof. By the de¬nition,

E [X; Bi ]

Y = E [X | G] = 1Bi = bi 1Bi

P(Bi )

i i

if we set bi = E [X; Bi ]/P(Bi ). The set (Y ≥ a) is a union of some of the Bi , namely, those

Bi for which bi ≥ a. But the union of any collection of the Bi is in G.

An example might help. Suppose

Y = 2 · 1B1 + 3 · 1B2 + 6 · 1B3 + 4 · 1B4

and a = 3.5. Then (Y ≥ a) = B3 ∪ B4 , which is in G.

11

Proposition 3.3. If C ∈ G and Y = E [X | G], then E [Y ; C] = E [X; C].

E [X;Bi ]

Proof. Since Y = P(Bi ) 1Bi and the Bi are disjoint, then

E [X; Bj ]

E [Y ; Bj ] = E 1Bj = E [X; Bj ].

P(Bj )

Now if C = Bj1 ∪ · · · ∪ Bjn ∪ · · ·, summing the above over the jk gives E [Y ; C] = E [X; C].

Let us look at the above example for this proposition, and let us do the case where

C = B2 . Note 1B2 1B2 = 1B2 because the product is 1 · 1 = 1 if ω is in B2 and 0 otherwise.

On the other hand, it is not possible for an ω to be in more than one of the Bi , so

1B2 1Bi = 0 if i = 2. Multiplying Y in the above example by 1B2 , we see that

E [Y ; C] = E [Y ; B2 ] = E [Y 1B2 ] = E [3 · 1B2 ]

= 3E [1B2 ] = 3P(B2 ).

However the number 3 is not just any number; it is E [X; B2 ]/P(B2 ). So

E [X; B2 ]

3P(B2 ) = P(B2 ) = E [X; B2 ] = E [X; C],

P(B2 )

just as we wanted. If C = B1 ∪ B4 , for example, we then write

E [X; C] = E [X1C ] = E [X(1B2 + 1B4 )]

= E [X1B2 ] + E [X1B4 ] = E [X; B2 ] + E [X; B4 ].

By the ¬rst part, this equals E [Y ; B2 ]+E [Y ; B4 ], and we undo the above string of equalities

but with Y instead of X to see that this is E [Y ; C].

If a r.v. Y is G measurable, then for any a we have (Y = a) ∈ G which means that

(Y = a) is the union of one or more of the Bi . Since the Bi are disjoint, it follows that Y

must be constant on each Bi .

Again let us look at an example. Suppose Z takes only the values 1, 3, 4, 7. Let

D1 = (Z = 1), D2 = (Z = 3), D3 = (Z = 4), D4 = (Z = 7). Note that we can write

Z = 1 · 1D1 + 3 · 1D2 + 4 · 1D3 + 7 · 1D4 .

To see this, if ω ∈ D2 , for example, the right hand side will be 0 + 3 · 1 + 0 + 0, which agrees

with Z(ω). Now if Z is G measurable, then (Z ≥ a) ∈ G for each a. Take a = 7, and we

see D4 ∈ G. Take a = 4 and we see D3 ∪ D4 ∈ G. Taking a = 3 shows D2 ∪ D3 ∪ D4 ∈ G.

12

c

Now D3 = (D3 ∪ D4 ) © D4 , so since G is a σ-¬eld, D3 ∈ G. Similarly D2 , D1 ∈ G. Because

sets in G are unions of the Bi ™s, we must have Z constant on the Bi ™s. For example, if it

so happened that D1 = B1 , D2 = B2 ∪ B4 , D3 = B3 ∪ B6 ∪ B7 , and D4 = B5 , then

Z = 1 · 1B1 + 3 · 1B2 + 4 · 1B3 + 3 · 1B4 + 7 · 1B5 + +4 · 1B6 + 4 · 1B7 .

We still restrict ourselves to the discrete case. In this context, the properties given

in Propositions 3.2 and 3.3 uniquely determine E [X | G].

Proposition 3.4. Suppose Z is G measurable and E [Z; C] = E [X; C] whenever C ∈ G.

Then Z = E [X | G].

Proof. Since Z is G measurable, then Z must be constant on each Bi . Let the value of Z

on Bi be zi . So Z = i zi 1Bi . Then

zi P(Bi ) = E [Z; Bi ] = E [X; Bi ],

or zi = E [X; Bi ]/P(Bi ) as required.

The following propositions contain the main facts about this new de¬nition of con-

ditional expectation that we will need.

Proposition 3.5. (1) If X1 ≥ X2 , then E [X1 | G] ≥ E [X2 | G].

(2) E [aX1 + bX2 | G] = aE [X1 | G] + bE [X2 | G].

(3) If X is G measurable, then E [X | G] = X.

(4) E [E [X | G]] = E X.

(5) If X is independent of G, then E [X | G] = E X.

We will prove Proposition 3.5 in Note 1 at the end of the section. At this point it

is more fruitful to understand what the proposition says.

We will see in Proposition 3.8 below that we may think of E [X | G] as the best

prediction of X given G. Accepting this for the moment, we can give an interpretation of

(1)-(5). (1) says that if X1 is larger than X2 , then the predicted value of X1 should be

larger than the predicted value of X2 . (2) says that the predicted value of X1 + X2 should

be the sum of the predicted values. (3) says that if we know G and X is G measurable,

then we know X and our best prediction of X is X itself. (4) says that the average of the

predicted value of X should be the average value of X. (5) says that if knowing G gives us

no additional information on X, then the best prediction for the value of X is just E X.

Proposition 3.6. If Z is G measurable, then E [XZ | G] = ZE [X | G].

We again defer the proof, this time to Note 2.

Proposition 3.6 says that as far as conditional expectations with respect to a σ-

¬eld G go, G-measurable random variables act like constants: they can be taken inside or

outside the conditional expectation at will.

13

Proposition 3.7. If H ‚ G ‚ F, then