Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Understanding the Metropolis-Hastings

Algorithm

Jiguang Li

Center for Applied Artiﬁcial Intelligence

June 11th, 2021

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Motivations: Estimating IRT Models

Formula: Recall a 2-parameter IRT model:

P(X

= 1|θ

, b

, α

) =

(θ

−b

)

1 + e

(θ

−b

)

, where

: whether student j will answer question i correctly

Student Ability : θ

Item Diﬃculty: b

Item Discrimination Term: α

> 0

Loglikelihood: Let n be number of students, k be number

of questions, X be our data, Θ be all the parameters:

p(X |Θ) =

i=1

j=1

i,j

log(P(X

= 1|Θ)) (1)

+ (1 − X

) log(P(X

= 0|Θ) (2)

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Motivations: Estimating IRT Models

Formula: Recall a 2-parameter IRT model:

P(X

= 1|θ

, b

, α

) =

(θ

−b

)

1 + e

(θ

−b

)

, where

: whether student j will answer question i correctly

Student Ability : θ

Item Diﬃculty: b

Item Discrimination Term: α

> 0

Loglikelihood: Let n be number of students, k be number

of questions, X be our data, Θ be all the parameters:

p(X |Θ) =

i=1

j=1

i,j

log(P(X

= 1|Θ)) (1)

+ (1 − X

) log(P(X

= 0|Θ) (2)

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Motivations: Estimating IRT Models

Formula: Recall a 2-parameter IRT model:

P(X

= 1|θ

, b

, α

) =

(θ

−b

)

1 + e

(θ

−b

)

, where

: whether student j will answer question i correctly

Student Ability : θ

Item Diﬃculty: b

Item Discrimination Term: α

> 0

Loglikelihood: Let n be number of students, k be number

of questions, X be our data, Θ be all the parameters:

p(X |Θ) =

i=1

j=1

i,j

log(P(X

= 1|Θ)) (1)

+ (1 − X

) log(P(X

= 0|Θ) (2)

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Bayesian Inference

Given Data X , and parameters Θ, the posterior distribution

can be written as:

p(Θ|X ) =

p(X |Θ)P(Θ)

p(X |Θ)dθ

(3)

p(θ|X ): posterior distribution

p(x|Θ): likelihood

P(Θ): prior belief

p(X |Θ)dΘ : hard to compute

Q: How can we sample Θ from the posterior distribution

without computing

p(X |Θ)dΘ?

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Bayesian Inference

Given Data X , and parameters Θ, the posterior distribution

can be written as:

p(Θ|X ) =

p(X |Θ)P(Θ)

p(X |Θ)dθ

(3)

p(θ|X ): posterior distribution

p(x|Θ): likelihood

P(Θ): prior belief

p(X |Θ)dΘ : hard to compute

Q: How can we sample Θ from the posterior distribution

without computing

p(X |Θ)dΘ?

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

The Metropolis-Hastings Algorithm

Recall p(Θ|X ) is hard to sample. M-H algorithm says we may

accept a candidate Θ from a proposal distribution Q(Θ) with

some probabilities α(., .) instead.

Algorithm 1 The Metropolis-Hastings Algorithm

Input : an arbitrary value Θ

, the proposal distribution Q(Θ)

For t = 1, · · · N:

Generate a candidate Θ

∗

from Q(Θ)

Generate a uniform distribution from u ∼ Uniform(0, 1)

Compute α(Θ

, Θ

∗

) = min{1,

P(Θ

∗

|X )Q(Θ

(t)

)

P(Θ|X )Q(Θ

∗

)

}

If u < α(., .), set Θ

t+1

= Θ

∗

; else Θ

t+1

= Θ

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

The Metropolis-Hastings Algorithm

Recall p(Θ|X ) is hard to sample. M-H algorithm says we may

accept a candidate Θ from a proposal distribution Q(Θ) with

some probabilities α(., .) instead.

Algorithm 2 The Metropolis-Hastings Algorithm

Input : an arbitrary value Θ

, the proposal distribution Q(Θ)

For t = 1, · · · N:

Generate a candidate Θ

∗

from Q(Θ)

Generate a uniform distribution from u ∼ Uniform(0, 1)

Compute α(Θ

, Θ

∗

) = min{1,

P(Θ

∗

|X )Q(Θ

(t)

)

P(Θ|X )Q(Θ

∗

)

}

If u < α(., .), set Θ

t+1

= Θ

∗

; else Θ

t+1

= Θ

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Three Questions

Choice of proposal distribution Q(Θ): not the focus of the

talk

Is the probability α(Θ

, Θ

∗

) = min{1,

P(Θ

∗

|X )Q(Θ

(t)

)

P(Θ|X )Q(Θ

∗

)

} easy

to compute?

P(Θ

∗

|X )

P(Θ

|X )

P(X |Θ

∗

)P(Θ

∗

)

P(X )

P(X |Θ

)P(Θ

)

P(X )

P(X |Θ

∗

)P(Θ

∗

)

P(X |Θ

)P(Θ

)

Why α(Θ

, Θ

∗

) works? Focus of the talk

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Three Questions

Choice of proposal distribution Q(Θ): not the focus of the

talk

Is the probability α(Θ

, Θ

∗

) = min{1,

P(Θ

∗

|X )Q(Θ

(t)

)

P(Θ|X )Q(Θ

∗

)

} easy

to compute?

P(Θ

∗

|X )

P(Θ

|X )

P(X |Θ

∗

)P(Θ

∗

)

P(X )

P(X |Θ

)P(Θ

)

P(X )

P(X |Θ

∗

)P(Θ

∗

)

P(X |Θ

)P(Θ

)

Why α(Θ

, Θ

∗

) works? Focus of the talk

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Three Questions

Choice of proposal distribution Q(Θ): not the focus of the

talk

Is the probability α(Θ

, Θ

∗

) = min{1,

P(Θ

∗

|X )Q(Θ

(t)

)

P(Θ|X )Q(Θ

∗

)

} easy

to compute?

P(Θ

∗

|X )

P(Θ

|X )

P(X |Θ

∗

)P(Θ

∗

)

P(X )

P(X |Θ

)P(Θ

)

P(X )

P(X |Θ

∗

)P(Θ

∗

)

P(X |Θ

)P(Θ

)

Why α(Θ

, Θ

∗

) works? Focus of the talk

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Three Questions

Choice of proposal distribution Q(Θ): not the focus of the

talk

Is the probability α(Θ

, Θ

∗

) = min{1,

P(Θ

∗

|X )Q(Θ

(t)

)

P(Θ|X )Q(Θ

∗

)

} easy

to compute?

P(Θ

∗

|X )

P(Θ

|X )

P(X |Θ

∗

)P(Θ

∗

)

P(X )

P(X |Θ

)P(Θ

)

P(X )

P(X |Θ

∗

)P(Θ

∗

)

P(X |Θ

)P(Θ

)

Why α(Θ

, Θ

∗

) works? Focus of the talk

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Markov chain review

Consider a Markov chain deﬁned on a continuous state with a

transition kernel P(x, A)

P(x, A): the probability of moving point x ∈ R

to set A

P(x, R

) = 1

P(x, {x}) not necessarily 0

Under some conditions, there exists a stationary distribution of

this Markov chain:

∗

(A) =

P(x, A)π(x)dx (4)

∗

: the stationary distribution

A: some set in R

π: density of the stationary distribution

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Markov chain review

Consider a Markov chain deﬁned on a continuous state with a

transition kernel P(x, A)

P(x, A): the probability of moving point x ∈ R

to set A

P(x, R

) = 1

P(x, {x}) not necessarily 0

Under some conditions, there exists a stationary distribution of

this Markov chain:

∗

(A) =

P(x, A)π(x)dx (4)

∗

: the stationary distribution

A: some set in R

π: density of the stationary distribution

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

How does Markov chain relate to P(Θ|x) ?

MH-algorithm simulates a random walk on a Markov chain

with some probability transitional kernel P(x, y), which has

stationary distribution P(Θ|X ). Each step of the walk is

accepted with some probability α(., .)

Question:

How do we construct transitional kernel P(x, y), so that

the Markov chain can converge to the posterior?

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

How does Markov chain relate to P(Θ|x) ?

MH-algorithm simulates a random walk on a Markov chain

with some probability transitional kernel P(x, y), which has

stationary distribution P(Θ|X ). Each step of the walk is

accepted with some probability α(., .)

Question:

How do we construct transitional kernel P(x, y), so that

the Markov chain can converge to the posterior?

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

”The transition Kernel” : P(x, A)

Consider the following deﬁnition:

P(x, A) =

p(x, y )dy + r(x)δ

(A) (5)

p(x, y ) : some function such that p(x, x) = 0

(A) = 1 if x ∈ A, and 0 otherwise

r(x) = 1 −

p(x, y )dy

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Proof the transition kernel works

Recall our kernel: P(x, A) =

p(x, y )dy + r(x)δ

(A)

Detailed balance: π(x)p(x, y) = π(y)p(y, x)

P(x, A)π(x)dx =

Z Z

p(x, y )dy π(x)dx +

r(x)δ

(A)π(x)dx

p(x, y )π(x)dxdy +

r(x)δ

(A)π(x)dx

p(y, x)π(y )dxdy +

r(x)δ

(A)π(x)dx

(1 − r (y ))π(y)dy +

r(x)π(x)dx

π(y)dy = π

∗

(A)

(6)

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Proof the transition kernel works

Recall our kernel: P(x, A) =

p(x, y )dy + r(x)δ

(A)

Detailed balance: π(x)p(x, y) = π(y)p(y, x)

P(x, A)π(x)dx =

Z Z

p(x, y )dy π(x)dx +

r(x)δ

(A)π(x)dx

p(x, y )π(x)dxdy +

r(x)δ

(A)π(x)dx

p(y, x)π(y )dxdy +

r(x)δ

(A)π(x)dx

(1 − r (y ))π(y)dy +

r(x)π(x)dx

π(y)dy = π

∗

(A)

(6)

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Proof the transition kernel works

Recall our kernel: P(x, A) =

p(x, y )dy + r(x)δ

(A)

Detailed balance: π(x)p(x, y) = π(y)p(y, x)

P(x, A)π(x)dx =

Z Z

p(x, y )dy π(x)dx +

r(x)δ

(A)π(x)dx

p(x, y )π(x)dxdy +

r(x)δ

(A)π(x)dx

p(y, x)π(y )dxdy +

r(x)δ

(A)π(x)dx

(1 − r (y ))π(y)dy +

r(x)π(x)dx

π(y)dy = π

∗

(A)

(6)

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Finding p(x, y)

Consider a candidate-generating density q(y |x), such that

q(y|x)dx = 1. We will be done if we have :

π(x)q(y |x) = π(y )q(x|y)

Consider the case π(x)q(y |x) ≥ π(y )q(x|y) , we can add

a term α(x, y ) ≤ 1:

π(x)q(y |x)α(x, y) = π(y)q(x|y ) =⇒ α(x, y) =

π(y)q(x|y )

π(x)q(y |x)

In the case π(x)q(y |x) ≤ π(y )q(x|y), we can let

α(x, y ) = 1 and introduce α(y, x) on the right side

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Finding p(x, y)

Consider a candidate-generating density q(y |x), such that

q(y|x)dx = 1. We will be done if we have :

π(x)q(y |x) = π(y )q(x|y)

Consider the case π(x)q(y |x) ≥ π(y )q(x|y)

, we can add

a term α(x, y ) ≤ 1:

π(x)q(y |x)α(x, y) = π(y)q(x|y ) =⇒ α(x, y) =

π(y)q(x|y )

π(x)q(y |x)

In the case π(x)q(y |x) ≤ π(y )q(x|y), we can let

α(x, y ) = 1 and introduce α(y, x) on the right side

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Finding p(x, y)

Consider a candidate-generating density q(y |x), such that

q(y|x)dx = 1. We will be done if we have :

π(x)q(y |x) = π(y )q(x|y)

Consider the case π(x)q(y |x) ≥ π(y )q(x|y) , we can add

a term α(x, y ) ≤ 1:

π(x)q(y |x)α(x, y) = π(y)q(x|y ) =⇒ α(x, y) =

π(y)q(x|y )

π(x)q(y |x)

In the case π(x)q(y |x) ≤ π(y )q(x|y), we can let

α(x, y ) = 1 and introduce α(y, x) on the right side

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Finding p(x, y)

Consider a candidate-generating density q(y |x), such that

q(y|x)dx = 1. We will be done if we have :

π(x)q(y |x) = π(y )q(x|y)

Consider the case π(x)q(y |x) ≥ π(y )q(x|y) , we can add

a term α(x, y ) ≤ 1:

π(x)q(y |x)α(x, y) = π(y)q(x|y ) =⇒ α(x, y) =

π(y)q(x|y )

π(x)q(y |x)

In the case π(x)q(y |x) ≤ π(y )q(x|y), we can let

α(x, y ) = 1 and introduce α(y, x) on the right side

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

p(x, y) just found!

If we let α(x, y ) = min{1,

π(y)q(x|y)

π(x)q(y |x)

}, we can deﬁne

p(x, y ) = α(x, y )q(y |x)

Then the detailed balance equation is satisﬁed:

π(x)p(x, y ) = π(y )p(y , x)

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Motivation: We have trouble sampling Θ from the

posterior P(Θ|X )

Step 1: We can ﬁnd a transition kernel P(x, A) of a

Markov chain whose stationary distribution is P(Θ|X )

Step 2: We consider the transitional kernel

P(x, A) =

p(x, y )dy + r(x)δ

(A)

Step 3: We showed the kernel works if p(x, y) fulﬁlls

detailed balance π(x)p(x, y ) = π(y )p(y , x)

Step 4: We showed the function p(x, y ) = α(x, y )q(y |x)

satisﬁes detained balance.

That is: if we sample from q(y|x) and accept with

probability α(x, y), then we’ll (eventually) be sampling

from the posterior distribution .

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Motivation: We have trouble sampling Θ from the

posterior P(Θ|X )

Step 1: We can ﬁnd a transition kernel P(x, A) of a

Markov chain whose stationary distribution is P(Θ|X )

Step 2: We consider the transitional kernel

P(x, A) =

p(x, y )dy + r(x)δ

(A)

Step 3: We showed the kernel works if p(x, y) fulﬁlls

detailed balance π(x)p(x, y ) = π(y )p(y , x)

Step 4: We showed the function p(x, y ) = α(x, y )q(y |x)

satisﬁes detained balance.

That is: if we sample from q(y|x) and accept with

probability α(x, y), then we’ll (eventually) be sampling

from the posterior distribution .

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Motivation: We have trouble sampling Θ from the

posterior P(Θ|X )

Step 1: We can ﬁnd a transition kernel P(x, A) of a

Markov chain whose stationary distribution is P(Θ|X )

Step 2: We consider the transitional kernel

P(x, A) =

p(x, y )dy + r(x)δ

(A)

Step 3: We showed the kernel works if p(x, y) fulﬁlls

detailed balance π(x)p(x, y ) = π(y )p(y , x)

Step 4: We showed the function p(x, y ) = α(x, y )q(y |x)

satisﬁes detained balance.

That is: if we sample from q(y|x) and accept with

probability α(x, y), then we’ll (eventually) be sampling

from the posterior distribution .

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Motivation: We have trouble sampling Θ from the

posterior P(Θ|X )

Step 1: We can ﬁnd a transition kernel P(x, A) of a

Markov chain whose stationary distribution is P(Θ|X )

Step 2: We consider the transitional kernel

P(x, A) =

p(x, y )dy + r(x)δ

(A)

Step 3: We showed the kernel works if p(x, y) fulﬁlls

detailed balance π(x)p(x, y ) = π(y )p(y , x)

Step 4: We showed the function p(x, y ) = α(x, y )q(y |x)

satisﬁes detained balance.

That is: if we sample from q(y|x) and accept with

probability α(x, y), then we’ll (eventually) be sampling

from the posterior distribution .

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Motivation: We have trouble sampling Θ from the

posterior P(Θ|X )

Step 1: We can ﬁnd a transition kernel P(x, A) of a

Markov chain whose stationary distribution is P(Θ|X )

Step 2: We consider the transitional kernel

P(x, A) =

p(x, y )dy + r(x)δ

(A)

Step 3: We showed the kernel works if p(x, y) fulﬁlls

detailed balance π(x)p(x, y ) = π(y )p(y , x)

Step 4: We showed the function p(x, y ) = α(x, y )q(y |x)

satisﬁes detained balance.

That is: if we sample from q(y|x) and accept with

probability α(x, y), then we’ll (eventually) be sampling

from the posterior distribution .

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Motivation: We have trouble sampling Θ from the

posterior P(Θ|X )

Step 1: We can ﬁnd a transition kernel P(x, A) of a

Markov chain whose stationary distribution is P(Θ|X )

Step 2: We consider the transitional kernel

P(x, A) =

p(x, y )dy + r(x)δ

(A)

Step 3: We showed the kernel works if p(x, y) fulﬁlls

detailed balance π(x)p(x, y ) = π(y )p(y , x)

Step 4: We showed the function p(x, y ) = α(x, y )q(y |x)

satisﬁes detained balance.

That is: if we sample from q(y|x) and accept with

probability α(x, y), then we’ll (eventually) be sampling

from the posterior distribution .

Understanding

the

Metropolis-

Hastings

Algorithm

Jiguang Li

Introductions

MH Algorithm

Derivations

Summary

Reference

S.Chib and E.Greenberg (1995) Understanding the

Metropolis-Hastings Algorithm