Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278:...

39
EE 278 Lecture Notes # 3 Winter 2010–2011 Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 1 Random Variables Probability space (, F , P) A (real-valued) random variable is a real-valued function defined on with a technical condition (to be stated) Common to use upper-case letters. E.g., a random variable X is a function X : R. Y , Z, U, V , Θ, ··· Also common: random variable may take on values only in some subset X R (sometimes called the alphabet of X, A X and X also common notations) Intuition: Randomness is in experiment, which produces outcome ω according to probability P random variable outcome is X(ω) X R. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 2 Examples Consider (, F , P) with = R, P determined by uniform pdf on [0, 1) Coin flip from earlier: X : R {0, 1} by X(r) = 0 if r 0.5 1 otherwise . Observe X, do not observe outcome of fair spin. Lots of possible random variables, e.g., W(r) = r 2 , Z (r) = e r , V (r) = r, L(r) = r ln r (require r 0), Y (r) = cos(2πr), etc. Can think of rvs as observations or measurements made on an underlying experiment. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 3 Functions of random variables Suppose that X is a rv defined on (, F , P) and suppose that g : X R is another real-valued function. Then the function g(X): R defined by g(X)(ω) = g(X(ω)) is also a real-valued mapping of , i.e., a real-valued function of a random variable is a random variable Can express the previous examples as W = V 2 , Z = e V , L = V ln V , Y = cos(2πV ) Similarly, 1/W, sinh(Y ), L 3 are all random variables EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 4

Transcript of Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278:...

Page 1: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

EE 278Lecture Notes # 3Winter 2010–2011

Random variables, vectors,and processes

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Random Variables

Probability space (Ω,F , P)

A (real-valued) random variable is a real-valued function defined onΩ with a technical condition (to be stated)

Common to use upper-case letters. E.g., a random variable X is afunction X : Ω→ R. Y,Z,U,V,Θ, · · ·Also common: random variable may take on values only in somesubset ΩX ⊂ R (sometimes called the alphabet of X, AX and X alsocommon notations)

Intuition: Randomness is in experiment, which produces outcome ωaccording to probability P⇒ random variable outcome isX(ω) ∈ ΩX ⊂ R.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 2

Examples

Consider (Ω,F , P) with Ω = R, P determined by uniform pdf on [0, 1)

Coin flip from earlier: X : R→ 0, 1 by

X(r) =

0 if r ≤ 0.51 otherwise

.

Observe X, do not observe outcome of fair spin.

Lots of possible random variables, e.g., W(r) = r2, Z(r) = er, V(r) = r,L(r) = −r ln r (require r ≥ 0), Y(r) = cos(2πr), etc.

Can think of rvs as observations or measurements made on anunderlying experiment.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 3

Functions of random variables

Suppose that X is a rv defined on (Ω,F , P) and suppose thatg : ΩX → R is another real-valued function.

Then the function g(X) : Ω→ R defined by g(X)(ω) = g(X(ω)) is alsoa real-valued mapping of Ω, i.e., a real-valued function of a randomvariable is a random variable

Can express the previous examples as W = V2, Z = eV, L = −V ln V,Y = cos(2πV)

Similarly, 1/W, sinh(Y), L3 are all random variables

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 4

Page 2: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Random vectors and random processes

A finite collection of random variables (defined on a commonprobability space (Ω,F , P) is a random vector

E.g., (X,Y), (X0, X1, · · · , Xk−1)

An infinite collection of random variables (defined on a commonprobability space) is a random process

E.g., Xn, n = 0, 1, 2, · · · , X(t); t ∈ (−∞,∞)

So theory of random vectors and random processes mostly boilsdown to theory of random variables.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 5

Derived distributions

In general: “input” probability space (Ω,F , P) + random variable X ⇒“output” probability space, say (ΩX,B(ΩX), PX), where ΩX ⊂ R and PX

is distribution of X PX(F) = Pr(X ∈ F)

Typically PX described by pmf pX or pdf fX

For binary quantizer special case derived PX.

Idea generalizes and forces a technical condition on definition ofrandom variable (and hence also on random vector and randomprocess)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 6

Inverse image formula

Given (Ω,B(Ω), P) and a random variable X, find PX

Basic method: PX(F) = the probability computed using P of all theoriginal sample points that are mapped by X into the subset F:

PX(F) = P(ω : X(ω) ∈ F)

Shorthand way to write formula in terms of inverse image of an eventF ∈ B(ΩX) under the mapping X : Ω→ ΩX: X−1(F) = r : X(r) ∈ F:

PX(F) = P(X−1(F))

Written informally as PX(F) = Pr(X ∈ F) = PX ∈ F = “probability thatrandom variable X assumes a value in F”

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 7

X−1(F) X

F

Inverse image method: Pr(X ∈ F) = P(ω : X(ω) ∈ F) = P(X−1(F))

inverse image formula — fundamental to probability, randomprocesses, signal processing.

Shows how to compute probabilities of output events in terms of theinput probability space does the definition make sense?

i.e., is PX(F) = P(X−1(F)) well-defined for all output events F??

Yes if include requirement in definition of random variable —

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 8

Page 3: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Careful definition of a random variable

Given a probability space (Ω,F , P), a (real-valued) random variableX is a function X : Ω→ ΩX ⊂ R with the property that

if F ∈ B(ΩX), then X−1(F) ∈ F

Notes:

• In English: X : Ω→ ΩX ⊂ R is a random variable iff the inverseimage of every output event is an input event and thereforePX(F) = P(X−1(F)) is well-defined for all events F.

• Another name for a function with this property: measurablefunction

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 9

• Most every function we encounter is measurable, but calculus ofprobability rests on this property and advanced courses provemeasurability of important functions.

In simple binary quantizer example, X is measurable (easy to showsince F = B([0, 1)) contains intervals) Recall

PX(0) = P(r : X(r) = 0) = P(X−1(0))= P(r : 0 ≤ r ≤ 0.5) = P([0, 0.5]) = 0.5

PX(1) = P(X−1(1)) = P((0.5, 1.0]) = 0.5

PX(ΩX) = PX(0, 1) = P(X−1(0, 1) = P([0, 1)) = 1

PX(∅) = P(X−1(∅)) = P(∅) = 0,

In general, find PX by computing pmf or pdf, as appropriate.Many shortcuts, but basic approach is inverse image formula.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 10

Random vectors

All theory, calculus, applications of individual random variables usefulfor studying random vectors and random processes since randomvectors and processes are simply collections of random variables.

One k-dimensional random vector = k 1-dimensional randomvariables defined on a common probability space.

Earlier example: two coin flips, k-coin flips (first k binary coefficientsof fair spinner)

Several notations used, e.g., Xk = (X0, X1, . . . , Xk−1) is shorthand forXk(ω) = (X0(ω), X1(ω), . . . , Xk−1)(ω)

or X or Xn; n = 0, 1, . . . , k − 1 or Xn; n ∈ ZkEE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 11

Can be discrete (discribed by multidimensional pmf) or continuous(e.g., described by multidimensional pdf) or mixed

Recall that a real-valued function of a random variable is a randomvariable.

Similarly, a real-valued function of a random vector (several randomvariables) is a random variable. E.g., if X0, X1, . . . Xn−1 are randomvariables, then

S n =1n

n−1

k=0

Xk

is a random variable defined by

S n(ω) =1n

n−1

k=0

Xk(ω)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 12

Page 4: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Inverse image formula for random vectors

PX(F) = P(X−1(F)) = P(ω : X(ω) ∈ F)= P(ω : (X0(ω), X1(ω), . . . , Xk−1(ω)) ∈ F)

where the various forms are equivalent and all stand for Pr(X ∈ F)

Technically, the formula holds for suitable events F ∈ B(R)k, the Borelfield of Rk (or some suitable subset). See book for discussion.

One multidimensional event of particular interest is a Cartesianproduct of 1D events (called a rectangle):F = ×k−1

i=0 Fi = xk : xi ∈ Fi; i = 0, . . . , k − 1

PX(F) = P(ω : X0(ω) ∈ F0, X1(ω) ∈ F1, . . . , Xk−1(ω) ∈ Fk−1)EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 13

Random processes

A random vector is a finite collection of rvs defined on a commonprobability space

A random process is an infinite family of rvs defined on a commonprobability space. Many types:

Xn; n = 0, 1, 2, . . . (discrete-time, one-sided)

Xn; n ∈ Z (discrete-time, two-sided)

Xt; t ∈ [0,∞) (continuous-time, one-sided)

Xt; t ∈ R (continuous-time, two-sided)

Also called stochastic process

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 14

In general: Xt; t ∈ T or X(t); t ∈ T

Other notations: X(t), X[n] (for discrete-time)

Sloppy but common: X(t), context tells rp and not single rv

Also called a stochastic process. Discrete-time random processesare also called time series

Always: a random process is an indexed family of random variables,T is index set

For each t, Xt is a random variable. All Xt defined on a commonprobability space

index is usually time, in some applications it is space, e.g., randomfield X(t, s); t, s ∈ [0, 1) models a random image,V(x, y, t); x, y ∈ [0, 1); t ∈ [0,∞) models analog video.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 15

Keep in mind the suppressed argument ω— e.g., each Xt is Xt(ω), afunction defined on the sample space

X(t) is X(t,ω), it can be viewed as a function of two arguments

Have seen one example — fair coin flips, a Bernoulli random process

Another, simpler, example:

Random sinusoids Suppose that A and Θ are two random variableswith a joint pdf fA,θ(a, θ) = fA(a) fΘ(θ). For example, Θ ∼ U([0, 2π))and A ∼ N(0,σ2). Define a continuous-time random process X(t) forall t ∈ R

X(t) = A cos(2πt + Θ)

Or, making the dependence on ω explicit,

X(t,ω) = A(ω) cos(2πt + Θ(ω))

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 16

Page 5: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Derived distributions for random variables

General problem: Given probability space (Ω,F , P) and a randomvariable X with range space (alphabet) ΩX. Find the distribution PX.

If ΩX is discrete, then PX described by a pmf

pX(x) = P(X−1(x)) = P(ω : X(ω) = x)

PX(F) =

x∈FpX(x) = P(X−1(F))

If ΩX is continuous, then need a pdf.

But a pdf is not a probability so inverse image formula does not applyimmediately⇒ alter approach

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 17

Cumulative distribution functions

Define cumulative distribution function (cdf) by

FX(x) ≡ x

−∞fX(r)dr = Pr(X ≤ x)

This is a probability and inverse image formula works

FX(x) = P(X−1((−∞, x]))

and from calculusfX(x) =

ddx

FX(x)

So first find cdf FX(x), then differentiate to find fX(x)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 18

Notes:

• If a ≥ b, then since (−∞, a] = (−∞, b] ∪ (b, a] is the union of disjointintervals, then FX(a) = FX(b) + PX((b, a]) and hence

PX((a, b]) = b

afX(x) dx = FX(b) − FX(a)

⇒ FX(x) is monontonically nondecreasing

• cdf is well defined for discrete rvs:

FX(r) = Pr(X ≤ r) =

x:x≤r

pX(x),

but not as useful. Not needed for derived distributions

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 19

If original space (Ω,F , P) is a discrete probability space, then rv Xdefined on (Ω,F , P) is also discrete

Inverse image formula⇒

pX(x) = PX(x) = P(X−1(x)) =

ω:X(ω)=x

p(ω)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 20

Page 6: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Example: discrete derived distribution

Ω = Z+, P determined by the geometric pmf

Define a random variable Y : Y(ω) =

1 if ω even0 if ω odd

Using the inverse image formula for the pmf for Y(ω) = 1:

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 21

pY(1) =

ω:ω even

(1 − p)k−1p =

k=2,4,...

(1 − p)k−1p

=p

(1 − p)

k=1

((1 − p)2)k = p(1 − p)∞

k=0

((1 − p)2)k

= p(1 − p)

1 − (1 − p)2 =1 − p2 − p

pY(0) = 1 − pY(1) =1

2 − p

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 22

Suppose original space is (Ω,F , P) = (R,B(R), P) where P isdescribed by a pdf g:

P(F) =

r∈Fg(r) dr; F ∈ B(R).

X a rv. Inverse image formula⇒

PX(F) = P(X−1(F)) =

r: X(r)∈Fg(r) dr.

If X discrete, find the pmf pX(x) =

r: X(r)=xg(r) dr

Quantizer example did this.

If X is continuous, want the pdf. First find cdf then differentiate.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 23

Example: continuous derived distribution

Square of a random variable

(R,B(R), P) with P induced by a Gaussian pdf.

Define W : R→ R by W(r) = r2; r ∈ R.

Find pdf fW. First find cdf FW, then differentiate. If w < 0, FW(w) = 0.If w ≥ 0,

FW(w) = Pr(W ≤ w) = P(ω : W(ω) = ω2 ≤ w)

= P([−w1/2,w1/2]) = w1/2

−w1/2g(r) dr

This can be complicated, but don’t need to plug in g yet

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 24

Page 7: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Use integral differentiation formula to get pdf directly —

ddw

b(w)

a(w)g(r) dr = g(b(w))

db(w)dw

− g(a(w))da(w)

dw

In our example

fW(w) = g(w1/2)w−1/2

2

− g(−w1/2)

−w−1/2

2

E.g., if g =N(0,σ2), then

fW(w) =w−1/2√

2πσ2e−w/2σ2

; w ∈ [0,∞).

— a chi-squared pdf with one degree of freedom

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 25

Example: continuous derived distribution

The max and min functions

Let X ∼ fX(x) and Y ∼ fY(y) be independent so thatfX,Y(x, y) = fX(x) fY(y).

DefineU = maxX,Y,V = minX,Y

where

max(x, y) =

x if x ≥ yy otherwise

min(x, y) =

y if x ≥ yx otherwise

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 26

Find the pdfs of U and V.

To find the pdf of U, we first find its cdf. U ≤ u iff both X and Y are≤ u, so using independence

FU(u) = Pr(U ≤ u) = Pr(X ≤ u,Y ≤ u) = FX(u)FY(u)

Using the product rule for derivatives,

fU(u) = fX(u)FY(u) + fY(u)FX(u)

To find the pdf of V, first find the cdf. V ≤ v iff either X or Y ≤ v so thatusing independence

FV(v) = Pr(X ≤ v or Y ≤ v)

= 1 − Pr(X > v,Y > v)

= 1 − (1 − FX(v))(1 − FY(v))

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 27

ThusfV(v) = fX(v) + fY(v) − fX(v)FY(v) − fY(v)FX(v)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 28

Page 8: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Directly-given random variables

All named examples of pmfs (uniform, Bernoulli, binomial, geometric,Poisson) and pdfs (uniform, exponential, Gaussian, Laplacian,chi-squared, etc.) and the probability spaces they imply can beconsidered as describing random variables:

Suppose (Ω,F , P) is a probability space with Ω ⊂ R.

Define a random variable V : Ω→ Ω

V(ω) = ω

— the identity mapping, random variable just reports original samplevalue ω

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 29

Implies output probability space in trivial way:

PV(F) = P(V−1(F)) = P(F)

If original space discrete (continuous), so is random variable, andrandom variable is described by pmf (pdf)

A random variable is said to be Bernoulli, binomial, etc. if itsdistribution is determined by a Bernoulli, binomial, etc. pmf (or pdf)

Two random variables V and X (possibly defined on differentexperiments) are said to be equivalent or identically distributed ifPV = PX, i.e., PV(F) = PX(F) all events F

E.g., both continuous with same pdf, or both discrete with same pmf

Example: Binary random variable defined as quantization of fairspinner vs. directly given as above.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 30

Note: Two ways to describe random variables:

1. Describe a probability space (Ω,F , P) and define a function X onit. Together these imply distribution PX for rv (by a pmf or pdf)

2. (Directly given) Describe distribution PX directly (by a pmf or pdf).

Implicitly (Ω,F , P) = (ΩX,B(ΩX), PX) and X(ω) = ω.

Both representations are useful.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 31

Derived distributions: random vectors

As in the scalar case, distribution can be described by probabilityfunctions — cdf’s and either pmfs or pdfs (or both)

If random vector has a discrete range space, then the distribution canbe described by a multidimensional pmf pX(x) = PX(x) = Pr(X = x)as

PX(F) =

x∈FpX(x) =

(x0,x1,...,xk−1)∈FpX0,X1,...,Xk−1(x0, x1, . . . , xk−1)

If the random vector X has a continuous range space, thendistribution can be described by a multidimensional pdf fXPX(F) =

F fX(x) dx Use multidimensional cdf to find pdf

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 32

Page 9: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Given a k-dimensional random vector X, define cumulativedistribution function (cdf) FX by

FX(x)

= FX0,X1,...,Xk−1(x0, x1, . . . , xk−1)

= PX(α : αi ≤ xi; i = 0, 1, . . . , k − 1)= Pr(Xi ≤ xi; i = 0, 1, . . . , k − 1)

=

x0

−∞

x1

−∞· · · xk−1

−∞fX0,X1,...,Xk−1(α0,α1, . . . ,αk−1)dα0dα1 · · · dαk−1

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 33

Other ways to express multidimensional cdf:

FX(x) = PX×k−1

i=0 (−∞, xi]

= P(ω : Xi(ω) ≤ xi; i = 0, 1, . . . , k − 1)

= P

k−1

i=0

X−1i ((−∞, xi])

.

Integration and differentiation are inverses of each other⇒

fX0,X1,...,Xk−1(x0, x1, . . . , xk−1)

=∂k

∂x0∂x1 . . . ∂xk−1FX0,X1,...,Xk−1(x0, x1, . . . , xk−1).

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 34

Joint and marginal distributions

Random vector X = (X0, X1, . . . , Xk−1) is a collection of randomvariables defined on a common probability space (Ω,F , P)

Alternatively, X is a random vector that takes on values randomly asdescribed by a probability distribution PX, without explicit reference tothe underlying probability space.

Either the original probability measure P or the induced distributionPX can be used to compute probabilities of events involving therandom vector.

E.g., finding the distributions of individual components of the randomvector.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 35

For example, if X = (X0, X1, . . . , Xk−1) is discrete, described by a pmfpX, then distribution for PX0 is described by pmf pX0(x0) which can becomputed as

pX0(x0) = P(ω : X0(ω) = x0)= P(ω : X0(ω) = x0, Xi(ω) ∈ ΩX; i = 1, 2, . . . , k − 1)=

x1,x2,...,xk−1

pX(x0, x1, x2, . . . , xk−1)

In English, all of these are Pr(X0 = x0)

In general we have for cdfs that

FX0(x0) = P(ω : X0(ω) ≤ x0)

= P(ω : X0(ω) ≤ x0, Xi(ω) ∈ ΩX; i = 1, 2, . . . , k − 1)= FX(x0,∞,∞, . . . ,∞)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 36

Page 10: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

⇒ if the pdfs exist,

fX0(x0) =

fX(x0, x1, x2, . . . , xk−1)dx1dx2 . . . dxk−1

Can find distributions for any of the components in this way:

pXi(α)

=

x0,x1,...,xi−1,xi+1,...,xk−1

pX0,X1,...,Xk−1(x0, x1, . . . , xi−1,α, xi+1, . . . , xk−1)

or

fXi(α) =

dx0 . . . dxi−1dxi+1 . . . dxk−1 fX0,...,Xk−1(x0, . . . , xi−1,α, xi+1, . . . , xk−1)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 37

Sum or integrate over all of the dummy variables corresponding tothe unwanted random variables in the vector to obtain the pmf or pdffor the random variable Xi

FXi(α) = FX(∞,∞, . . . ,∞,α,∞, . . . ,∞),or Pr(Xi ≤ α) = Pr(Xi ≤ α and Xj ≤ ∞, all j i)

Similarly can find cdfs/pmfs/pdfs for any pairs or triples of randomvariables in the random vector or any other subvector (at least intheory)

These relations are called consistency relationships — a randomvector distribution implies many other distributions, and these mustbe consistent with each other.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 38

2D random vectors

Ideas are clearest when only 2 rvs: (X,Y) a random vector.

marginal distribution of X is obtained from the joint distribution of Xand Y by leaving Y unconstrained

PX(F) = PX,Y((x, y) : x ∈ F, y ∈ R); F ∈ B(R).

Marginal cdf of X is FX(α) = FX,Y(α,∞)

If the range space of the vector (X,Y) is discrete,

pX(x) =

y

pX,Y(x, y).

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 39

If the range space of the vector (X,Y) is continuous and the cdf isdifferentiable so that fX,Y(x, y) exists,

fX(x) = ∞

−∞fX,Y(x, y) dy,

with similar expressions for the distribution for rv Y.

Joint distributions imply marginal distributions.

The opposite is not true without additional assumptions, e.g.,independence.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 40

Page 11: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Examples of joint and marginal distributions

Example

Suppose rvs X and Y are such that the random vector (X,Y) has apmf of the form

pX,Y(x, y) = r(x)q(y),

where r and q are both valid pmfs. (pX,Y is a product pmf)

Then

pX(x) =

y

pX,Y(x, y) =

y

r(x)q(y)

= r(x)

y

q(y) = r(x).

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 41

Thus in the special case of a product distribution, knowing themarginal pmfs is enough to know the joint distribution. Thus marginaldistributions + independence⇒ the joint distribution.

Pair of fair coins provides an example:

pXY(x, y) = pX(x)pY(y) =14

; x, y = 0, 1

pX(x) = pY(y) =12

; x = 0, 1

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 42

Example of where marginals not enough

Flip two fair coins connected by a piece of flexible rubber

pXY(x, y)0 1

0 0.4 0.11 0.1 0.4

⇒ pX(x) = pY(y) = 1/2, x = 0, 1

Not a product distribution, but same marginals as product distributioncase

Quite different joints can yield the same marginals. Marginals alonedo not tell the story.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 43

Another example

Loaded pair of six-sided dice have property the sum of the two dice =7 on every roll.

All 6 combinations possible combinations ( (1,6), (2,5), (3,4), (4,3),(5,2), (6,1)) have equal probability.

Suppose outcome of one die is X, other is Y

(X,Y) is a random vector taking values in 1, 2, . . . , 62

pX,Y(x, y) =16, x + y = 7, (x, y) ∈ 1, 2, . . . , 62.

Find marginal pmfs

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 44

Page 12: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

pX(x) =

y

pXY(x, y) = pXY(x, 7 − x) =16, x = 1, 2, . . . , 6

Same as if product distribution. marginals alone do not imply joint

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 45

Continuous example

(X,Y) a rv with a pdf that is constant on the unit disk in the XY plane:

fX,Y(x, y) =

C x2 + y2 ≤ 10 otherwise

Find marginal pdfs. Is it a product pdf?

Need C:

x2+y2≤1C dx dy = 1.

Integral = area of a circle multiplied by C ⇒ C = 1/π.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 46

fX(x) = +√1−x2

−√

1−x2C dy = 2C

√1 − x2 , x2 ≤ 1.

Could now also find C by a second integration:

+1

−12C√

1 − x2 dx = πC = 1,

or C = π−1.

ThusfX(x) = 2π−1

√1 − x2 , x2 ≤ 1.

By symmetry Y has the same pdf. fX,Y not a product pdf.

Note marginal pdf is not constant, even though the joint pdf is.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 47

Joints and marginals: Gaussian pair

2D Gaussian pdf with k = 2, m = (0, 0)t, andΛ = λ(i, j) : λ(1, 1) = λ(2, 2) = 1, λ(1, 2) = λ(2, 1) = ρ.Inverse matrix is

1 ρρ 1

−1

=1

1 − ρ2

1 −ρ−ρ 1

,

the joint pdf for the random vector (X,Y) is

fX,Y(x, y) =exp− 1

2(1−ρ2)(x2 + y2 − 2ρxy)

1 − ρ2, (x, y) ∈ R2.

ρ called “correlation coefficient”

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 48

Page 13: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Need ρ2 < 1 for Λ to be positive definite

To find the pdf of X, integrate joint over y

Do this using standard trick: complete the square:

x2 + y2 − 2ρxy = (y − ρx)2 − ρ2x2 + x2 = (y − ρx)2 + (1 − ρ2)x2

fX,Y(x, y) =exp−(y−ρx)2

2(1−ρ2) −x2

2

1 − ρ2=

exp−(y−ρx)2

2(1−ρ2)

2π(1 − ρ2)

exp−x2

2

√2π.

Part of joint is N(ρx, 1 − ρ2), which integrates to 1. Thus

fX(x) = (2π)−1/2e−x2/2.

Note marginals the same regardless of ρ!

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 49

Consistency & directly given processes

Have seen two ways to describe (specify) a random variable – as aprobability space + a function (random variable), or a directly given rv(a distribution — pdf or pmf)

Same idea works for random vectors.

What about random processes? E.g., direct definition of fair coinflipping process.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 50

For simplicity, consider discrete time, discrete alphabet randomprocess, say Xn. Given random process, can use inverse imageformula to compute pmf for any finite collection of samples(Xk1, Xk2, . . . , XkK), e.g.,

pXk1,Xk2,...,XkK(x1, x2, . . . , xK) = Pr(Xki = xi; i = 1, . . . ,K)

= P(ω : Xki(ω) = xi; i = 1, . . . ,K)

For example, in the fair coin flipping process

pXk1,Xk2,...,XkK(x1, x2, . . . , xK) = 2−K, all (x1, x2, . . . , xK) ∈ 0, 1K

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 51

The axioms of probability⇒ that these pmfs for any choice of K andk1, . . . , kK must be consistent in the sense that if any of the pmfs isused to compute the probability of an event, the answer must be thesame. E.g.,

pX1(x1) =

x2

pX1,X2(x1, x2)

=

x0,x2

pX0,X1,X2(x0, x1, x2)

=

x3,x5

pX1,X3,X5(x0, x2, x5)

since all of these computations yield the same probability in theoriginal probability space Pr(X1 = x1) = P(ω : X1(ω) = x1)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 52

Page 14: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Bottom line If given a discrete time discrete alphabet randomprocess Xn; n ∈ Z, then for any finite K and collection of K sampletimes k1, . . . , kK can find the joint pmf pXk1,Xk2,...,XkK

(x1, x2, . . . , xK) andthis collection of pmfs must be consistent.

Kolmogorov proved a converse to this idea now called theKolmogorov extension theorem, which provides the most commonmethod for describing a random process:

Theorem. Kolmogorov extension theorem for discrete timeprocesses Given a consistent family of finite-dimensional pmfspXk1,Xk2,...,XkK

(x1, x2, . . . , xK) for all dimensions K and sample timesk1, . . . , kK, then there is a random process Xn; n ∈ Z described bythese marginals.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 53

To completely describe a random process, you need only provide aformula for a consistent family of pmfs for finite collections ofsamples.

The same result holds for continuous time random processes and forcontinuous alphabet processes (family of pdfs)

Difficult to prove, but most common way to specify model.Kolmogorov or directly-given representation of a random process –describe consistent family of vector distributions. For completeness:

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 54

Theorem. Kolmogorov extension theoremSuppose that one is given a consistent family of finite-dimensionaldistributions PXt0,Xt1,...,Xtk−1

for all positive integers k and all possiblesample times ti ∈ T ; i = 0, 1, . . . , k − 1. Then there exists a randomprocess Xt; t ∈ T that is consistent with this family. In other words,to describe a random process completely, it is sufficient to describe aconsistent family of finite-dimensional distributions of its samples.

Example: Given a pmf p, define a family of vector pmfs by

pXk1,Xk2,...,XkK(x1, x2, . . . , xK) =

K

i=1

p(xk),

then there is a random process Xn having these vector pmfs forfinite collections of samples. A process of this form is called an iidprocess.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 55

The continuous alphabet analogy is defined in terms of a pdf f —define the vector pdfs by

fXk1,Xk2,...,XkK(x1, x2, . . . , xK) =

K

i=1

f (xk)

A discrete time continuous alphabet process is iid if its joint pdfsfactor in this way.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 56

Page 15: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Independent random variables

Return to definition of independent rvs, with more explanation.

Definition of independent random variables an application ofdefinition of independent events.

Defined events F and G to be independent if P(F ∩G) = P(F)P(G)

Two random variables X and Y defined on a probability space areindependent if the events X−1(F) and Y−1(G) are independent for allF and G in B(R), i.e., if

P(X−1(F) ∩ Y−1(G)) = P(X−1(F))P(Y−1(G))

Equivalently, Pr(X ∈ F,Y ∈ G) = Pr(X ∈ F) Pr(Y ∈ G) orPXY(F ×G) = PX(F)PY(G)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 57

If X, Y discrete, choosing F = x, Y = y⇒

pXY(x, y) = pX(x)pY(y) all x, y

Conversely, if joint pmf = product of marginals, then evaluatePr(X ∈ F,Y ∈ G) as

P(X−1(F) ∩ Y−1(G)) =

x∈F,y∈GpXY(x, y) =

x∈F,y∈GpX(x)pY(y)

=

x∈FpX(x)

y∈GpY(y)

= P(X−1(F))P(Y−1(G))

⇒ independent by general definition

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 58

For general random variables, consider F = (−∞, x], G = (−∞, y].Then if X,Y independent, FXY(x, y) = FX(x)FY(y) all x, y. If pdfs exist,this implies that

fXY(x, y) = fX(x) fY(y)

Conversely, if this relation holds for all x, y, thenP(X−1(F) ∩ Y−1(G)) = P(X−1(F))P(Y−1(G)) and hence X and Y areindependent.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 59

A collection of rvs Xi, i = 0, 1, . . . , k − 1 is independent or mutuallyindependent if all collections of events of the formX−1

i (Fi); i = 0, 1, . . . , k − 1 are mutually independent for anyFi ∈ B(R); i = 0, 1, . . . , k − 1.

A collection of discrete random variables Xi; i = 0, 1, . . . , k − 1 ismutually independent iff

pX0,...,Xk−1(x0, . . . , xk−1) =k−1

i=0

pXi(xi); ∀xi.

A collection of continuous random variables is independent iff thejoint pdf factors as

fX0,...,Xk−1(x0, . . . , xk−1) =k−1

i=0

fXi(xi).

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 60

Page 16: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

A collection of general random variables is independent iff the jointcdf factors as

FX0,...,Xk−1(x0, . . . , xk−1) =k−1

i=0

FXi(xi); (x0, x1, . . . , xk−1) ∈ Rk.

The random vector is independent, identically distributed (iid) if thecomponents are independent and the marginal distributions are allthe same.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 61

Conditional distributions

Apply conditional probability to distributions.

Can express joint probabilities as products even if rvs notindependent

E.g., distribution of input given observed output (for inference)

There are many types: conditional pmfs, conditional pdfs, conditionalcdfs

Elementary and nonelementary conditional probability

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 62

Discrete conditional distributions

Simplest, direct application of elementary conditional probability topmfs

Consider 2D discrete random vector (X,Y)

alphabet AX × AY

joint pmf pX,Y(x, y)

marginal pmfs pX and pY

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 63

Define for each x ∈ AX for which pX(x) > 0 the conditional pmf

pY |X(y|x) = P(Y = y|X = x)

=P(Y = y, X = x)

P(X = x)

=P(ω : Y(ω) = y ∩ ω : X(ω) = x)

P(ω : X(ω) = x)

=pX,Y(x, y)

pX(x),

elementary conditional probability that Y = y given X = x

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 64

Page 17: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Properties of conditional pmfs

For fixed x, pY |X(·|x) is a pmf:

y∈AY

pY |X(y|x) =

y∈AY

pX,Y(x, y)pX(x)

=1

pX(x)

y∈AY

pX,Y(x, y)

=1

pX(x)pX(x) = 1.

The joint pmf can be expressed as a product as

pX,Y(x, y) = pY |X(y|x)pX(x).

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 65

Can compute conditional probabilities by summing conditional pmfs,

P(Y ∈ F|X = x) =

y∈FpY |X(y|x)

Can write probabilities of events of the form X ∈ G,Y ∈ F (rectangles)as

P(X ∈ G,Y ∈ F) =

x,y:x∈G,y∈FpX,Y x, y

=

x∈GpX(x)

y∈FpY |X(y | x)

=

x∈GpX(x)P(F | X = x)

Later: define nonelementary conditional probability to mimic thisformula

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 66

If X and Y are independent, then pY |X(y|x) = pY(y)

Given pY |X, pX, Bayes rule for pmfs:

pX|Y(x|y) =pX,Y(x, y)

pY(y)=

pY |X(y|x)pX(x)

u pY |X(y|u)pX(u),

a result often referred to as Bayes’ rule.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 67

Example of Bayes rule: Binary Symmetric Channel

Consider the following binary communication channel

X ∈ 0, 1 +

Z ∈ 0, 1

Y ∈ 0, 1

Bit sent is X ∼ Bern(p), 0 ≤ p ≤ 1, noise is Z ∼ Bern(), 0 ≤ ≤ 0.5,bit received is Y = (X + Z) mod 2 = X ⊕ Z, and X and Z areindependent

Find 1) pX|Y(x|y), 2) pY(y), and 3) PrX Y, the probability of error

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 68

Page 18: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

1.To find pX|Y(x|y) use Bayes rule

pX|Y(x|y) =pY |X(y|x)

x∈AX

pY |X(y|x)pX(x)pX(x)

Know pX(x), but we need to find pY |X(y|x):

pY |X(y|x) = PrY = y | X = x = PrX ⊕ Z = y | X = x= Prx ⊕ Z = y | X = x = PrZ = y ⊕ x | X = x= PrZ = y ⊕ x since Z and X are independent

= pZ(y ⊕ x)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 69

Therefore

pY |X(0 | 0) = pZ(0 ⊕ 0) = pZ(0) = 1 − pY |X(0 | 1) = pZ(0 ⊕ 1) = pZ(1) =

pY |X(1 | 0) = pZ(1 ⊕ 0) = pZ(1) =

pY |X(1 | 1) = pZ(1 ⊕ 1) = pZ(0) = 1 −

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 70

Plugging into Bayes rule:

pX|Y(0|0) =pY |X(0|0)

pY |X(0|0)pX(0) + pY |X(0|1)pX(1)pX(0) =

(1 − )(1 − p)(1 − )(1 − p) + p

pX|Y(1|0) = 1 − pX|Y(0|0) =p

(1 − )(1 − p) + p

pX|Y(0|1) =pY |X(1|0)

pY |X(1|0)pX(0) + pY |X(1|1)pX(1)pX(0) =

(1 − p)(1 − )p + (1 − p)

pX|Y(1|1) = 1 − pX|Y(0|1) =(1 − )p

(1 − )p + (1 − p)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 71

2.We already found pY(y) as

pY(y) = pY |X(y |0)pX(0) + pY |X(y |1)pX(1)

=

(1 − )(1 − p) + p for y = 0

(1 − p) + (1 − )p for y = 1

3.Now to find the probability of error PrX Y, consider

PrX Y = pX,Y(0, 1) + pX,Y(1, 0)

= pY |X(1|0)pX(0) + pY |X(0|1)pX(1)

= (1 − p) + p =

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 72

Page 19: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

An interesting special case is = 12. Here, PrX Y = 1

2, which isthe worst possible (no information is sent), and

pY(0) = 12 p + 1

2(1 − p) = 12 = pY(1)

Therefore Y ∼ Bern(12), independent of the value of p !

In this case, the bit sent X and the bit received Y are independent(check this)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 73

Conditional pmfs for vectors

Random vector (X0, X1, . . . , Xk−1)

pmf pX0,X1,...,Xk−1

Define conditional pmfs (assuming denominators 0)

pXl|X0,...,Xl−1(xl|x0, . . . , xl−1) =pX0,...,Xl(x0, . . . , xl)

pX0,...,Xl−1(x0, . . . , xl−1).

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 74

⇒ chain rule

pX0,X1,...,Xn−1(x0, x1, . . . , xn−1)

=

pX0,X1,...,Xn−1(x0, x1, . . . , xn−1)pX0,X1,...,Xn−2(x0, x1, . . . , xn−2)

pX0,X1,...,Xn−2(x0, x1, . . . , xn−2)

...

= pX0(x0)n−1

i=1

pX0,X1,...,Xi(x0, x1, . . . , xi)pX0,X1,...,Xi−1(x0, x1, . . . , xi−1)

= pX0(x0)n−1

l=1

pXl|X0,...,Xl−1(xl|x0, . . . , xl−1)

Formula plays an important role in characterizing memory inprocesses. Can be used to construct joint pmfs, and to specify arandom process.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 75

Continuous conditional distributions

Continuous distributions more complicated

Given X,Y with joint pdf fX,Y, marginal pdfs fX, fY, define conditionalpdf

fY |X(y|x) ≡ fX,Y(x, y)fX(x)

.

analogous to conditional pmf, but unlike conditional pmf, not aconditional probability!

A density of conditional probability

Problem: conditioning event has probability 0. Elementaryconditional probability not work.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 76

Page 20: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Conditional pdf is a pdf:

fY |X(y|x) dy =

fX,Y(x, y)fX(x)

dy

=1

fX(x)

fX,Y(x, y) dy

=1

fX(x)fX(x) = 1,

provided require that fX(x) > 0 over the region of integration.

Given a conditional pdf fY |X, define (nonelementary) conditionalprobability that Y ∈ F given X = x by

P(Y ∈ F|X = x) ≡

FfY |X(y|x) dy.

Resembles discrete form.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 77

Nonelementary conditional probability

Does P(Y ∈ F|X = x) =

F fY |X(y|x) dy. make sense as an appropriatedefinition of conditional probability given an event of zero probability?

Observe that analogous to the ed result for pmfs, assuming thepdfs all make sense

P(X ∈ G,Y ∈ F) =

x,y:x∈G,y∈FfX,Y(x, y)dxdy

=

x∈GfX(x)

y∈FfY |X(y | x)dy

dx

=

x∈GfX(x)P(F | X = x)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 78

Our definition is ad hoc. But the careful mathematical definition ofconditional probability P(F | X = x) for an event of 0 probability ismade not by a formula such as we have used to define conditionalpmfs and pdfs and elementary conditional probability, but by itsbehavior inside an integral (like the Dirac delta). In particular,P(F | X = x) is defined as any measurable function satisfyingequation for all events F and G, which our definition does.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 79

Bayes rule for pdfs

Bayes rule:

fX|Y(x|y) =fX,Y(x, y)

fY(y)=

fY |X(y|x) fX(x)fY |X(y|u) fX(u) du

.

Example of conditional pdfs: 2D Gaussian

U = (X,Y), Gaussian pdf with mean (mX,mY)t and covariance matrix

Λ =

σ2

X ρσXσY

ρσXσY σ2Y

,

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 80

Page 21: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Algebra⇒

det(Λ) = σ2Xσ

2Y(1 − ρ2)

Λ−1 =1

(1 − ρ2)

1/σ2

X −ρ/(σXσY)−ρ/(σXσY) 1/σ2

Y

so

fXY(x, y)

=1

2π√

detΛe−

12(x−mX,y−mY)Λ−1(x−mX,y−mY)t

=1

2πσXσY

1 − ρ2exp− 1

2(1 − ρ2)

×

x − mX

σX

2− 2ρ

(x − mX)(y − mY)σXσY

+

y − mY

σY

2

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 81

Rearrange

fXY(x, y) =exp−1

2(x−mXσX

)2

2πσ2

X

exp−1

2

y−mY−(ρσY/σX)(x−mX)√

1−ρ2σY

2

2πσ2Y(1 − ρ2)

fY |X(y|x) =exp−1

2

y−mY−(ρσY/σX)(x−mX)√

1−ρ2σY

2

2πσ2Y(1 − ρ2)

,

Gaussian with variance σ2Y |X ≡ σ2

Y(1 − ρ2), meanmY |X ≡ mY + ρ(σY/σX)(x − mX)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 82

Integrate joint over y (as before)⇒

fX(x) =e−(x−mX)2/2σ2

X

2πσ2X

.

Similarly, fY(y) and fX|Y(x|y) are also Gaussian

Note: X and Y jointly Gaussian⇒ also both individually andconditionally Gaussian!

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 83

Chain rule for pdfs

Assume fX0,X1,...,Xi(x0, x1, . . . , xi) > 0,

fX0,X1,...,Xn−1(x0, x1, . . . , xn−1)

=fX0,X1,...,Xn−1(x0, x1, . . . , xn−1)fX0,X1,...,Xn−2(x0, x1, . . . , xn−2)

fX0,X1,...,Xn−2(x0, x1, . . . , xn−2)

...

= fX0(x0)n−1

i=1

fX0,X1,...,Xi(x0, x1, . . . , xi)fX0,X1,...,Xi−1(x0, x1, . . . , xi−1)

= fX0(x0)n−1

i=1

fXi|X0,...,Xi−1(xi|x0, . . . , xi−1).

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 84

Page 22: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Statistical detection and classification

Simple application of conditional probability mass functionsdescribing discrete random vectors

Transmitted: discrete rv X, pmf pX, pX(1) = p

(e.g., one sample of a binary random process)

Received: rv Y

Conditional pmf (noisy channel) pY |X(y|x)

More specific example as special case: X Bernoulli, parameter p

pY |X(y|x) =

x y1 − x = y

.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 85

binary symmetric channel (BSC)

Given observation Y, what is the best guess X(Y) of transmittedvalue?

decision rule or detection rule

Measure quality by probability guess is correct:

Pc(X) = Pr(X = X(Y)) = 1 − Pe,

wherePe(X) = Pr(X(Y) X).

A decision rule is optimal if it yields the smallest possible Pe ormaximum possible Pc

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 86

Pr(X = X) = 1 − Pe(X) =

(x,y):X(y)=x

pX,Y(x, y)

=

(x,y):X(y)=x

pX|Y(x|y)pY(y)

=

y

pY(y)

x:X(y)=x

pX|Y(x|y)

=

y

pY(y)pX|Y(X(y)|y).

To maximize sum, maximize pX|Y(X(y)|y) for each y.

Accomplished by X(y) ≡ arg maxu

pX|Y(u|y) which yields

pX|Y(X(y)|y) = maxu pX|Y(u|y)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 87

This is maximum a posteriori (MAP) detection rule

In binary example: Choose X(y) = y if < 1/2 and X(y) = 1 − y if > 1/2.

⇒ minimum (optimal) error probability over all possible rules ismin(, 1 − )

In general nonbinary case, statistical detection is statisticalclassification: Unseen X might be presence or absence of a disease,observation Y the results of various tests.

General Bayesian classification allows weighting of cost of differentkinds of errors (Bayes risk) so minimize a weighted average(expected cost) instead of only probability of error

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 88

Page 23: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Additive noise: Discrete random variables

Common setup in communications, signal processing, statistics:

Original signal X has random noise W (independent of X) added to it,observe Y = X +W

Typically use observation Y to make inference about X

Begin by deriving conditional distributions.

Discrete case: Have independent rvs X and W with pmfs pX and pW.Form Y = X +W. Find pY

Use inverse image formula:

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 89

pX,Y(x, y) = Pr(X = x,Y = y) = Pr(X = x, X +W = y)

=

α,β:α=x,α+β=y

pX,W(α, β) = pX,W(x, y − x)

= pX(x)pW(y − x).

Note: Formula only makes sense if y − x is in the range space of W

ThuspY |X(y|x) =

pX,Y(x, y)pX(x)

= pW(y − x),

Intuitive!

Marginal for Y:

pY(y) =

x

pX,Y(x, y) =

x

pX(x)pW(y − x)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 90

a discrete convolution

Above uses ordinary real arithmetic. Similar results hold for otherdefinitions of addition, e.g., modulo 2 arithmetic for binary

As with linear systems, convolutions usually be easily evaluated inthe transform domain. Will do shortly.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 91

Additive noise: continuous random variables

X, W, fXW(x,w) = fX(x) fW(w) (independent), Y = X +W

Find fY |X and fY

Since continuous, find joint pdf by first finding joint cdf

FX,Y(x, y) = Pr(X ≤ x,Y ≤ y) = Pr(X ≤ x, X +W ≤ y)

=

α,β:α≤x,α+β≤yfX,W(α, β) dα dβ

=

x

−∞dα y−α

−∞dβ fX(α) fW(β)

=

x

−∞dα fX(α)FW(y − α).

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 92

Page 24: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Taking derivatives:

fX,Y(x, y) = fX(x) fW(y − x)

⇒fY |X(y|x) = fW(y − x).

⇒fY(y) =

fX,Y(x, y) dx =

fX(x) fW(y − x) dx,

a convolution integral of the pdfs fX and fW

pdf fX|Y follows from Bayes’ rule:

fX|Y(x|y) =fX(x) fW(y − x)

fX(α) fW(y − α) dα.

Gaussian example:

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 93

Additive Gaussian noise

Assume fX = N(0,σX), fW = N(0,σ2Y), fXW(x,w) = fX(x) fW(w),

Y = X +W.

fY |X(y|x) = fW(y − x) =e−(y−x)2/2σ2

W

2πσ2W

which is N(x,σ2W).

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 94

To find fX|Y using Bayes’ rule, need fY:

fY(y) = ∞

−∞fY |X(y|α) fX(α) dα

=

−∞

exp− 1

2σ2W

(y − α)2

2πσ2

W

exp− 1

2σ2Xα2

2πσ2

X

=1

2πσXσW

−∞exp−

12

y2 − 2αy + α2

σ2W

+α2

σ2X

=

exp− y2

2σ2W

2πσXσW

−∞exp−

12

α2(

1σ2

X+

1σ2

W) − 2αyσ2

W

Can integrate by completing the square (later see an easier wayusing tranforms, but this trick is not difficult)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 95

Integrand resembles

exp−1

2(α − mσ

)2.

which has integral

−∞exp−1

2(α − mσ

)2

dα =√

2πσ2

(Gaussian pdf integrates to 1)

Compare

−12

α

2

1σ2

X+

1σ2

W

2αyσ2

W

vs. − 1

2

α − mσ

2= −1

2

α2

σ2 − 2αmσ2+

m2

σ2

.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 96

Page 25: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

The braced terms will be the same if choose

1σ2 =

1σ2

W+

1σ2

X⇒ σ2 =

σ2Xσ

2W

σ2X + σ

2W,

and

yσ2

W=

mσ2 ⇒ m =

σ2

σ2W

y.

α2

1σ2

X+

1σ2

W

2αyσ2

W=α − mσ

2− m2

σ2

“completing the square.’

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 97

−∞exp−

12

α2(

1σ2

X+

1σ2

W) − 2αyσ2

W

=

−∞exp−1

2

α − mσ2

2− m2

σ2

dα =

√2πσ2 exp

m2

2σ2

fY(y) =exp−1

2y2

σ2W

2πσXσW

√2πσ2 exp

m2

2σ2

=

exp−1

2y2

σ2X+σ

2W

2π(σ2

X + σ2W)

So fY = N(0,σ2X + σ

2W)

Sum of two independent 0 mean Gaussian rvs is another 0 meanGaussian rv, the variance of the sum = sum of the variances

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 98

For a posteriori probability fX|Y use Bayes’ rule + algebra

fX|Y(x|y) = fY |X(y|x) fX(x)/ fY(y)

=

exp− 1

2σ2W

(y − x)2

2πσ2

W

exp− 1

2σ2X

x2

2πσ2

X

/exp−1

2y2

σ2X+σ

2W

2π(σ2

X + σ2W)

=

exp−1

2

y2−2yx+x2

σ2W+ x2

σ2X− y2

σ2X+σ

2W

2πσ2

Xσ2W/(σ

2X + σ

2W)

=

exp− 1

2σ2Xσ

2W/(σ

2X+σ

2W)

(x − yσ2X/(σ

2X + σ

2W))2

2πσ2

Xσ2W/(σ

2X + σ

2W)

.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 99

fX|Y(x|y) = Nσ2

X

σ2X + σ

2W

y,σ2

Xσ2W

σ2X + σ

2W

.

The mean of a conditional distribution called a conditional mean, thevariance of a conditional distribution called a conditional variance

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 100

Page 26: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Continuous additive noise with discrete input

Most important case of mixed distributions in communicationsapplications

Typical: Binary random variable X, Gaussian random variable W, Xand W independent, Y = X +W

Previous examples do not work, one rv discrete, other continuous

Similar signal processing issue: Observe Y, guess X

As before, may be one sample of a random process, in practice haveXn, Wn, Yn. At time n, observe Yn, guess Xn

Conditional cdf FY |X(y|x) for Y given X = x is an elementaryconditional probability. Analogous to purely discrete and purely

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 101

continuous cases

FY |X(y|x) = Pr(Y ≤ y | X = x) = Pr(X +W ≤ y | X = x)

= Pr(x +W ≤ y | X = x) = Pr(W ≤ y − x | X = x)

= Pr(W ≤ y − x) = FW(y − x)

Differentiating,

fY |X(y|x) =ddy

FY |X(y|x) =ddy

FW(y − x) = fW(y − x)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 102

Joint distribution combined by a combination of pmf and pdf.

Pr(X ∈ F and Y ∈ G) =

F

pX(x)

GfY |X(y|x) dy

=

F

pX(x)

GfW(y − x) dy.

Choosing F = R yields

Pr(Y ∈ G) =

pX(x)

GfY |X(y|x) dy

=

pX(x)

GfW(y − x) dy.

Choosing G = (−∞, y] yields cdf FY(y)⇒

fY(y) =

pX(x) fY |X(y|x) =

pX(x) fW(y − x),

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 103

a convolution, analogous to pure discrete and pure continuous cases

Continuing analogy Bayes’ rule suggests conditional pmf:

pX|Y(x|y) =fY |X(y|x)pX(x)

fY(y)=

fY |X(y|x)pX(x)α pX(α) fY |X(y|α)

,

but this is not an elementary conditional probability, conditioningevent has probability 0!

Can be justified in similar way to conditional pdfs:

Pr(X ∈ F and Y ∈ G) =

Gdy fY(y) Pr(X ∈ F|Y = y)

=

Gdy fY(y)

F

pX|Y(x|y)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 104

Page 27: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

so that pX|Y(x|y) satisfies

Pr(X ∈ F|Y = y) =

F

pX|Y(x|y)

Apply to binary input and Gaussian noise: the conditional pmf of thebinary input given the noisy observation is

pX|Y(x|y) =fW(y − x)pX(x)

fY(y)

=fW(y − x)pX(x)α pX(α) fW(y − α)

; y ∈ R, x ∈ 0, 1.

Can now solve classical binary detection in Gaussian noise.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 105

Binary detection in Gaussian noise

The derivation of the MAP detector or classifier extends immediatelyto a binary input random variable and independent Gaussian noise

As in the purely discrete case, MAP detector X(y) of X given Y = y isgiven by

X(y) = argmaxx

pX|Y(x|y) = argmaxx

fW(y − x)pX(x)α pX(α) fW(y − α)

.

Denominator of the conditional pmf does not depend on x, thedenominator has no effect on the maximization

X(y) = argmaxx

pX|Y(x|y) = argmaxx

fW(y − x)pX(x).

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 106

Assume for simplicity that X is equally likely to be 0 or 1:

X(y) = argmaxx

pX|Y(x|y) = argmaxx

1

2πσ2W

exp−

12

(x − y)2

σ2W

= argmaxx

pX|Y(x|y) = argminx|x − y|

Minimum distance or nearest neighbor decision, choose closest x to y

X(y) =

0 y < 0.51 y > 0.5

.

A threshold detector

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 107

Error probability of optimal detector:

Pe = Pr(X(Y) X)

= Pr(X(Y) 0|X = 0)pX(0) + Pr(X(Y) 1|X = 1)pX(1)

= Pr(Y > 0.5|X = 0)pX(0) + Pr(Y < 0.5|X = 1)pX(1)

= Pr(W + X > 0.5|X = 0)pX(0) + Pr(W + X < 0.5|X = 1)pX(1)

= Pr(W > 0.5|X = 0)pX(0) + Pr(W + 1 < 0.5|X = 1)pX(1)

= Pr(W > 0.5)pX(0) + Pr(W < −0.5)pX(1)

using the independence of W and X. In terms of Φ function:

Pe =12

1 − Φ

0.5σW

+ Φ

−0.5σW

= Φ

− 1

2σW

.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 108

Page 28: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Statistical estimation

In detection/classification problems, goal is to guess which of adiscrete set of possibilities is true. MAP rule is an intuitive solution.

Different if (X,Y) continuous, observe Y, and guess X.

Examples: X,W independent Gaussian, Y = X +W. What is bestguess of X given Y?

Xn is a continuous alphabet random process (perhaps Gaussian).Observe Xn−1. What is best guess for Xn? What if observeX0, X1, X2, . . . , Xn−1?

Quality criteria for discrete case no longer works, Pr(X(Y) = Y) = 0 ingeneral.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 109

Will later introduce another quality measure (MSE) and optimize.

Now mention other approaches.

Examples of estimation or regression instead of detection

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 110

MAP Estimation

Mimic map detection, maximize conditional probability function

XMAP(y) = argmaxx fX|Y(x|y)Easy to describe, application of conditional pdfs + Bayes.

But can not argue “optimal” in sense of maximizing quality

Example: Gaussian signal plus noise

Found fX|Y(x|y) = Gaussian with mean yσ2X/(σ

2X + σ

2W)

Gaussian pdf maximized at its mean⇒ MAP estimate of X givenY = y is the conditional mean yσ2

X/(σ2X + σ

2W).

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 111

Maximum Likelihood Estimation

The maximum likelihood (ML) estimate of X given Y = the value of xthat maximizes the conditional pdf fY |X(y|x) (instead of the a posterioripdf fX|Y(x|y))

XML(y) = argmaxx

fY |X(y|x).

Advantage: Do not need to know prior fX and use Bayes to findfX|Y(x|y). Simple

In the Gaussian case, XML(Y) = y.

Will return to estimation when consider expectations in more detail.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 112

Page 29: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Characteristic functions

When sum independent random variables, find derived distribution byconvolution of pmfs or pdfs

Can be complicated, avoidable using transforms as in linear systems

Summing independent random variables arises frequently in signalanalysis problems. E.g., iid random process Xk is put into a linearfilter to produce an output Yn =

nk=1 hn−kXk.

What is distribution of Yn?

n-fold convolution a mess. Describe shortcut.

Transforms of probability functions called characteristic functions.Variation on Fourier/Laplace transforms. Notation varies.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 113

For discrete rv with pmf pX, define characteristic function MX

MX( ju) =

x

pX(x)e jux

where u is usually assumed to be real.

A discrete exponential transform. Sometimes φ, Φ, j not included.(∼ notational differences in Fourier transforms)

Alternative useful form: Recall definition of expectation of a randomvariable g defined on a discrete probability space described by a pmfg: E(g) =

ω p(ω)g(ω)

Consider probability space (ΩX,B(ΩX), PX) with PX described by pmfpX

This is directly-given representation for rv X, X is the identity functionon ΩX: X(x) = x

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 114

Define random variable g(X) on this space g(X)(x) = e jux. ThenE[g(X)] =

x

pX(x)e jux so that

MX( ju) = E[e juX]

Characteristic functions, like probabilities, can be viewed as specialcases of expectation

Resembles discrete time Fourier transform

Fν(pX) =

x

pX(x)e− j2πνx

and the z-transformZz(pX) =

x

pX(x)zx.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 115

MX( ju) = F−u/2π(pX) = Ze ju(pX)

Properties of characteristic functions follow from those ofFourier/Laplace/z/exponential transforms.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 116

Page 30: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Can recover pmf from MX by suitable inversion. E.g., givenpX(k); k ∈ ZN,

12π

π/2

−π/2MX( ju)e−iuk du =

12π

π/2

−π/2

x

pX(x)e jux

e−iuk du

=

x

pX(x)1

π/2

−π/2e ju(x−k) du

=

x

pX(x)δk−x = pX(k).

But usually invert by inspection or from tables, avoid inversetransforms

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 117

Characteristic functions and summing independentrvs

Two independent random variables X, W with pmfs pX and pW andcharacteristic functions MX and MW

Y = X +W

To find characteristic function of Y

MY( ju) =

y

pY(y)e juy

use the inverse image formula

pY(y) =

x,w:x+w=y

pX,W(x,w)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 118

to obtain

MY( ju) =

y

x,w:x+w=y

pX,W(x,w)

e juy =

y

x,w:x+w=y

pX,W(x,w)e juy

=

y

x,w:x+w=y

pX,W(x,w)e ju(x+w)

=

x,w

pX,W(x,w)e ju(x+w)

Last sum factors:

MY( ju) =

x,w

pX(x)pW(w)e juxe juw =

x

pX(x)e jux

w

pW(w)e juw

= MX( ju)MW( ju),

⇒ transform of the pmf of the sum of independent random variablesis the product of their transforms

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 119

Iterate:

Theorem 1. If Xi; i = 1, . . . ,N are independent random variableswith characteristic functions MXi, then the characteristic function ofthe random variable Y =

Ni=1 Xi is

MY( ju) =N

i=1

MXi( ju).

If the Xi are independent and identically distributed with commoncharacteristic function MX, then

MY( ju) = MNX ( ju).

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 120

Page 31: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Example: X Bernoulli with parameter p = pX(1) = 1 − pX(0)

MX( ju) =1

k=0

e juk pX(k) = (1 − p) + pe ju

Xi; i = 1, . . . , n iid Bernoulli random variables, Yn =n

k=1 Xi, then

MYn( ju) = [(1 − p) + pe ju]n

with binomial theorem⇒

MYn( ju) =n

k=0

pYn(k)e juk = ((1 − p) + pe ju)n

=

n

k=0

nk

(1 − p)n−k pk

pYn(k)

e juk ,

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 121

Uniqueness of transforms⇒

pYn(k) =

nk

(1 − p)n−k pk; k ∈ Zn+1.

Same idea works for continuous rvs

For a continous random variable X with pdf fX, define thecharacteristic function MX of the random variable (or of the pdf) as

MX( ju) =

fX(x)e jux dx.

As in the discrete case,

MX( ju) = Ee juX.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 122

Relates to the continuous-time Fourier transform

Fν( fX) =

fX(x)e− j2πνx dx

and the Laplace transform

Ls( fX) =

fX(x)e−sx dx

byMX( ju) = F−u/2π( fX) = L− ju( fX)

Hence can apply results from Fourier/Laplace transform theory. E.g.,given a well-behaved density fX(x); x ∈ R MX( ju), can inverttransform

fX(x) =1

−∞MX( ju)e− jux du.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 123

Consider again two independent random variables X and Y with pdfsfX and fW, characteristic functions MX and MW

Paralleling the discrete case,

MY( ju) = MX( ju)MW( ju).

Will later see simple and general proof.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 124

Page 32: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

As in the discrete case, iterating gives result for many independentrvs:

If Xi; i = 1, . . . ,N are independent random variables withcharacteristic functions MXi, then the characteristic function of therandom variable Y =

Ni=1 Xi is

MY( ju) =N

i=1

MXi( ju).

If the Xi are independent and identically distributed with commoncharacteristic function MX, then

MY( ju) = MNX ( ju).

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 125

Summing Independent Gaussian rvs

X ∼ N(m,σ2)

Characteristic function found by completing the square:

MX( ju) = E(e juX) = ∞

−∞

1(2πσ2)1/2e−(x−m)2/2σ2

e jux dx

=

−∞

1(2πσ2)1/2e−(x2−2mx−2σ2 jux+m2)/2σ2

dx

=

−∞

1(2πσ2)1/2e−(x−(m+ juσ2))2/2σ2

dx

e jum−u2σ2/2

= e jum−u2σ2/2.

Thus N(m,σ2)↔ e jum−u2σ2/2

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 126

Xi; i = 1, . . . , n iid Gaussian random variables with pdfs N(m,σ2)

Yn =n

k=1 Xi

ThenMYn( ju) = [e jum−u2σ2/2]n = e ju(nm)−u2(nσ2)/2,

= characteristic function of N(nm, nσ2)

Moral: Use characteristic functions to derive distributions of sums ofindependent rvs.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 127

Gaussian random vectors

A random vector is Gaussian if its density is Gaussian

Component rvs are jointly Gaussian

Description is complicated, but many nice properties

Multidimensional characteristic functions help derivation

Random vector X = (X0, . . . , Xn−1)

vector argument u = (u0, . . . , un−1)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 128

Page 33: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

n-dimensional characteristic function:

MX( ju) = MX0,...,Xn−1( ju0, . . . , jun−1) = Ee jutX

= E

exp

j

n−1

k=0

ukXk

Can be shown using multivariable calculus: Gaussian rv with meanvector m and covariance matrix Λ has characteristic function

MX( ju) = e jutm−utΛu/2

= exp

j

n−1

k=0

ukmk − 1/2n−1

k=0

n−1

m=0

ukΛ(k,m)um

Same basic form as Gaussian pdf, but depends directly on Λ, not Λ−1

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 129

So exists more generally, only need Λ to be nonnegative definite(instead of strictly positive definite). Define Gaussian rv moregenerally as a rv having a characteristic function of this form (inversetransform will have singularities)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 130

Further examples of random processes:

Have seen two ways to define rps: Indirectly in terms of anunderlying probability space or directly (Kolmogorov representation)by describing consistent family of joint distributions (via pmfs, pdfs, orcdfs).

Used to define discrete time iid processes and processes which canbe constructed from iid processes by coding or filtering.

Introduce more classes of processes and develop some propertiesfor various examples.

In particular: Gaussian random processes and Markov processes

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 131

Gaussian random processes

A random process Xt; t ∈ T is Gaussian if the random vectors(Xt0, Xt1, . . . , Xtk−1) are Gaussian for all positive integers k and allpossible sample times ti ∈ T ; i = 0, 1, . . . , k − 1.

Works for continuous and discrete time.

Consistent family?

Yes if all mean vectors and covariance matrices drawn from acommon mean function m(t); t ∈ T and covariance functionΛ(t, s); t, s ∈ T ; i.e., for any choice of sample times t0, . . . , tk−1 ∈ Tthe random vector (Xt0, Xt1, . . . , Xtk−1) is Gaussian with mean(m(t0),m(t1), . . . ,m(tk−1)) and the covariance matrix isΛ = Λ(tl, t j); l, j ∈ Zk.EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 132

Page 34: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Gaussian random processes in both discrete and continuous timeare extremely common in analysis of random systems and havemany nice properties.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 133

Discrete time Markov processes

An iid process is memoryless because present independent of past.

A Markov process allows dependence on the past in a structuredway.

Introduce via example.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 134

A binary Markov process

Xn; n = 0, 1, . . . is a Bernoulli process with

pXn(x) =

p x = 11 − p x = 0

,

p ∈ (0, 1) a fixed parameter

Since the pmf pXn(x), abbreviate to pX:

pX(x) = px(1 − p)1−x; x = 0, 1.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 135

Since process iid

pXn(xn) =n−1

i=0

pX(xi) = pw(xn)(1 − p)n−w(xn),

where w(xn) = Hamming weight of the binary vector xn.

Let Xn be input to a device which produces an output binaryprocess Yn defined by

Yn =

Y0 n = 0Xn ⊕ Yn−1 n = 1, 2, . . .

,

where Y0 is a binary equiprobable random variable(pY0(0) = pY0(1) = 0.5), independent of all of the Xn and ⊕ is mod 2addition

(linear filter using mod 2 arithmetic)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 136

Page 35: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Alternatively:

Yn =

1 if Xn Yn−1

0 if Xn = Yn−1.

This process is called a binary autoregressive process. As will beseen, it is also called the symmetric binary Markov process

Unlike Xn, Yn depends strongly on past values. Since p < 1/2, Yn ismore likely to equal Yn−1 than not

If p is small, Yn is likely to have long runs of 0s and 1s.

Task: Find joint pmfs for new process: pYn(yn) = Pr(Yn = yn)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 137

Use inverse image formula:

pYn(yn) = Pr(Yn = yn)

= Pr(Y0 = y0,Y1 = y1,Y2 = y2, . . . ,Yn−1 = yn−1)

= Pr(Y0 = y0, X1 ⊕ Y0 = y1, X2 ⊕ Y1 = y2, . . . , Xn−1 ⊕ Yn−2 = yn−1)

= Pr(Y0 = y0, X1 ⊕ y0 = y1, X2 ⊕ y1 = y2, . . . , Xn−1 ⊕ yn−2 = yn−1)

= Pr(Y0 = y0, X1 = y1 ⊕ y0, X2 = y2 ⊕ y1, . . . , Xn−1 = yn−1 ⊕ yn−2)

= pY0,X1,X2,X3,...,Xn−1(y0, y1 ⊕ y0, y2 ⊕ y1, . . . , yn−1 ⊕ yn−2)

= pY0(y0)n−1

i=1

pX(yi ⊕ yi−1).

Used the facts that (1) a ⊕ b = c iff a = b ⊕ c, (2) Y0, X1, X2, . . . , Xn−1

mutually independent, and (3) Xn are iid.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 138

Plug in specific forms of pY0 and pX ⇒

pYn(yn) =12

n−1

i=1

pyi⊕yi−1(1 − p)1−yi⊕yi−1.

Marginal pmfs for Yn evaluated by summing out the joints (totalprobability), e.g.,

pY1(y1) =

y0

pY0,Y1(y0, y1) =12

y0

py1⊕y0(1 − p)1−y1⊕y0

=12

; y1 = 0, 1.

In a similar fashion it can be shown that the marginals for Yn are allthe same:

pYn(y) =12

; y = 0, 1; n = 0, 1, 2, . . .

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 139

Hence drop subscript and abbreviate pmf to pY

Note: Would not be the same with different initialization, e.g., Y0 = 1

Unlike the iid Xn process

pYn(yn) n−1

i=0

pY(yi)

(provided p 1/2)

Yn not iid

Joint not product of marginals, but can use chain rule with conditionalprobabilities to write as product of conditional pmfs, given by

pYl|Y0,Y1,...,Yl−1(yl|y0, y1, . . . , yl−1) =pYl+1(yl+1)

pYl(yl)= pX(yl ⊕ yl−1)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 140

Page 36: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Note: Conditional probability of current output Yl given entire pastYi; i = 0, 1, . . . , l − 1 depends only on the most recent past outputYl−1! This property can be summarized nicely by also deriving theconditional pmf

pYl|Yl−1(yl|yl−1) =pYl−1,Yl(yl, yl−1)

pYl−1(yl−1)

= pyl⊕yl−1(1 − p)1−yl⊕yl−1

⇒pYl|Y0,Y1,...,Yl−1(yl|y0, y1, . . . , yl−1) = pYl|Yl−1(yl|yl−1).

A discrete time random process with this property is called a Markovprocess or Markov chain

The binary autoregressive process is a Markov process!

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 141

The binomial counting process

Next filter binary Bernoulli process using ordinary arithmetic.

Xn iid binary random process with marginal pmfpX(1) = p = 1 − pX(0).

Yn =

Y0 = 0 n = 0n

k=1 Xk = Yn−1 + Xn n = 1, 2, . . ..

Yn = output of a discrete time time-invariant linear filter with Kroneckerdelta response hk given by hk = 1 for k ≥ 0 and hk = 0 otherwise.

By definition,

Yn = Yn−1 or Yn = Yn−1 + 1; n = 2, 3, . . .

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 142

A discrete time process with this property is called a countingprocess. Will later see a continuous time counting process whichalso can only increase by 1

To completely describe this process need a formula for the joint pmfs

pY1,...,Yn(y1, . . . , yn) = pY1(y1)n

l=1

pYl|Y1,...,Yl−1(yl|y1, . . . , yl−1)

Already found marginal pmf pYn(k) using transforms to be binomial⇒binomial counting process

Find conditional pmfs, which imply joints via chain rule.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 143

pYn|Yn−1,...,Y1(yn|yn−1, . . . , y1)

= Pr(Yn = yn|Yl = yl; l = 1, . . . , yn−1)

= Pr(Xn = yn − yn−1|Yl = yl; l = 1, . . . , n − 1)

= Pr(Xn = yn − yn−1|X1 = y1, Xi = yi − yi−1; i = 2, 3, . . . , n − 1)

Follows since since conditioning event Yi = yi; i = 1, 2, . . . , n − 1 isthe event X1 = y1, Xi = yi − yi−1; i = 2, 3, . . . , n − 1 and, given thisevent, the event Yn = yn is the event Xn = yn − yn−1.

Thus

pYn|Yn−1,...,Y1(yn|yn−1, . . . , y1)

= pXn|Xn−1,...,X2,X1(yn − yn−1|yn−1 − yn−2, . . . , y2 − y1, y1)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 144

Page 37: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

Xn iid⇒pYn|Yn−1,...,Y1(yn|yn−1, . . . , y1) = pX(yn − yn−1)

Hence chain rule + definition y0 = 0⇒

pY1,...,Yn(y1, . . . , yn) =n

i=1

pX(yi − yi−1)

For binomial counting process, use Bernoulli pX:

pY1,...,Yn(y1, . . . , yn) =n

i=1

p(yi−yi−1)(1 − p)1−(yi−yi−1),

whereyi − yi−1 = 0 or 1, i = 1, 2, . . . , n; y0 = 0.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 145

Similar derivation⇒

pYn|Yn−1(yn|yn−1) = Pr(Yn = yn|Yn−1 = yn−1)= Pr(Xn = yn − yn−1|Yn−1 = yn−1).

Conditioning event, depends only on values of Xk for k < n, hencepYn|Yn−1(yn|yn−1) = pX(yn − yn−1) ⇒ Yn is Markov

Similar derivation works for sum of iid rvs with any pmf pX to showthat

pYn|Yn−1,...,Y1(yn|yn−1, . . . , y1) = pYn|Yn−1(yn|yn−1)or, equivalently,

Pr(Yn = yn|Yi = yi ; i = 1, . . . , n − 1) = Pr(Yn = yn|Yn−1 = yn−1),

⇒ Markov

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 146

Discrete random walk

Slight variation: Let Xn be binary iid with alphabet 1,−1 andPr(Xn = −1) = p

Yn =

0 n = 0n

k=1 Xk n = 1, 2, . . .,

Also has autoregressive format

Yn = Yn−1 + Xn, n = 1, 2, . . .

Transform of the iid random variables is

MX( ju) = (1 − p)e ju + pe− ju,

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 147

binomial theorem⇒

MYn( ju) = ((1 − p)e ju + pe− ju)n

=

n

k=0

nk

(1 − p)n−k pk

e ju(n−2k)

=

k=−n,−n+2,...,n−2,n

n

(n − k)/2

(1 − p)(n+k)/2p(n−k)/2

pYn(k)

e juk.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 148

Page 38: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

pYn(k) =

n(n − k)/2

(1 − p)(n+k)/2p(n−k)/2 ,

k = −n,−n + 2, . . . , n − 2, n.

Note that Yn must be even or odd depending on whether n is even orodd. This follows from the nature of the increments.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 149

The discrete time Wiener process

Xn iid N(0, ,σ2).

As with the counting process, define

Yn =

0 n = 0n

k=1 Xk n = 1, 2, . . .,

discrete time Wiener process

Handle in essentially the same way, but use cdfs and then pdfs

Previously found marginal fYn using transforms to be N(0, nσ2X)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 150

To find the joint pdfs use conditional pdfs and chain rule

fY1,...,Yn(y1, . . . , yn) =n

l=1

fYl|Y1,...,Yl−1(yl|y1, . . . , yl−1).

To find conditional pdf fYn|Y1,...,Yn−1(yn|y1, . . . , yn−1), first find conditionalcdf P(Yn ≤ yn|Yn−i = yn−i; i = 1, 2, . . . , n − 1)

. Analogous to the discrete case:

P(Yn ≤ yn|Yn−i = yn−i; i = 1, 2, . . . , n − 1)

= P(Xn ≤ yn − yn−1|Yn−i = yn−i; i = 1, 2, . . . , n − 1)

= P(Xn ≤ yn − yn−1) = FX(yn − yn−1),

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 151

Differentiating the conditional cdf to obtain the conditional pdf⇒

fYn|Y1,...,Yn−1(yn|y1, . . . , yn−1) =∂

∂ynFX(yn − yn−1) = fX(yn − yn−1),

pdf chain rule⇒

fY1,...,Yn(y1, . . . , yn) =n

i=1

fX(yi − yi−1).

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 152

Page 39: Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

If fX = N(0,σ2)

fYn(yn) =exp− y2

12σ2

√2πσ2

n

i=2

exp−(yi−yi−1)2

2σ2

√2πσ2

= (2πσ2)−n/2 exp

12σ2(

n

i=2

(yi − yi−1)2 + y21)

.

This is a joint Gaussian pdf with mean vector 0 and covariance matrixKX(m, n) = σ2 min(m, n), m, n = 1, 2, . . .

A similar argument implies that

fYn|Yn−1(yn|yn−1) = fX(yn − yn−1)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 153

and hence

fYn|Y1,...,Yn−1(yn|y1, . . . , yn−1) = fYn|Yn−1(yn|yn−1).

As in discrete alphabet case, a process with this property is called aMarkov process

Combine the discrete alphabet and continuous alphabet definitionsinto a common definition: a discrete time random process Yn is saidto be a Markov process if the conditional cdf’s satisfy the relation

Pr(Yn ≤ yn|Yn−i = yn−i; i = 1, 2, . . .) = Pr(Yn ≤ yn|Yn−1 = yn−1)

for all yn−1, yn−2, . . .

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 154

More specifically, such a Yn is frequently called a first-order Markovprocess because it depends on only the most recent past value. Anextended definition to nth-order Markov processes can be made inthe obvious fashion.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 155