Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Understanding the Metropolis-Hastings
Algorithm
Jiguang Li
Center for Applied Artificial Intelligence
June 11th, 2021
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Motivations: Estimating IRT Models
Formula: Recall a 2-parameter IRT model:
P(X
ij
= 1|θ
j
, b
i
, α
i
) =
e
α
i
(θ
j
b
i
)
1 + e
α
i
(θ
j
b
i
)
, where
X
ij
: whether student j will answer question i correctly
Student Ability : θ
j
Item Difficulty: b
i
Item Discrimination Term: α
i
> 0
Loglikelihood: Let n be number of students, k be number
of questions, X be our data, Θ be all the parameters:
p(X |Θ) =
n
X
i=1
k
X
j=1
X
i,j
log(P(X
ij
= 1|Θ)) (1)
+ (1 X
ij
) log(P(X
ij
= 0|Θ) (2)
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Motivations: Estimating IRT Models
Formula: Recall a 2-parameter IRT model:
P(X
ij
= 1|θ
j
, b
i
, α
i
) =
e
α
i
(θ
j
b
i
)
1 + e
α
i
(θ
j
b
i
)
, where
X
ij
: whether student j will answer question i correctly
Student Ability : θ
j
Item Difficulty: b
i
Item Discrimination Term: α
i
> 0
Loglikelihood: Let n be number of students, k be number
of questions, X be our data, Θ be all the parameters:
p(X |Θ) =
n
X
i=1
k
X
j=1
X
i,j
log(P(X
ij
= 1|Θ)) (1)
+ (1 X
ij
) log(P(X
ij
= 0|Θ) (2)
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Motivations: Estimating IRT Models
Formula: Recall a 2-parameter IRT model:
P(X
ij
= 1|θ
j
, b
i
, α
i
) =
e
α
i
(θ
j
b
i
)
1 + e
α
i
(θ
j
b
i
)
, where
X
ij
: whether student j will answer question i correctly
Student Ability : θ
j
Item Difficulty: b
i
Item Discrimination Term: α
i
> 0
Loglikelihood: Let n be number of students, k be number
of questions, X be our data, Θ be all the parameters:
p(X |Θ) =
n
X
i=1
k
X
j=1
X
i,j
log(P(X
ij
= 1|Θ)) (1)
+ (1 X
ij
) log(P(X
ij
= 0|Θ) (2)
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Bayesian Inference
Given Data X , and parameters Θ, the posterior distribution
can be written as:
p|X ) =
p(X |Θ)P(Θ)
R
p(X |Θ)dθ
(3)
p(θ|X ): posterior distribution
p(x|Θ): likelihood
P(Θ): prior belief
R
p(X |Θ)dΘ : hard to compute
Q: How can we sample Θ from the posterior distribution
without computing
R
p(X |Θ)dΘ?
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Bayesian Inference
Given Data X , and parameters Θ, the posterior distribution
can be written as:
p|X ) =
p(X |Θ)P(Θ)
R
p(X |Θ)dθ
(3)
p(θ|X ): posterior distribution
p(x|Θ): likelihood
P(Θ): prior belief
R
p(X |Θ)dΘ : hard to compute
Q: How can we sample Θ from the posterior distribution
without computing
R
p(X |Θ)dΘ?
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
The Metropolis-Hastings Algorithm
Recall p|X ) is hard to sample. M-H algorithm says we may
accept a candidate Θ from a proposal distribution Q(Θ) with
some probabilities α(., .) instead.
Algorithm 1 The Metropolis-Hastings Algorithm
Input : an arbitrary value Θ
0
, the proposal distribution Q(Θ)
For t = 1, · · · N:
Generate a candidate Θ
from Q(Θ)
Generate a uniform distribution from u Uniform(0, 1)
Compute α
t
, Θ
) = min{1,
P
|X )Q
(t)
)
P|X )Q
)
}
If u < α(., .), set Θ
t+1
= Θ
; else Θ
t+1
= Θ
t
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
The Metropolis-Hastings Algorithm
Recall p|X ) is hard to sample. M-H algorithm says we may
accept a candidate Θ from a proposal distribution Q(Θ) with
some probabilities α(., .) instead.
Algorithm 2 The Metropolis-Hastings Algorithm
Input : an arbitrary value Θ
0
, the proposal distribution Q(Θ)
For t = 1, · · · N:
Generate a candidate Θ
from Q(Θ)
Generate a uniform distribution from u Uniform(0, 1)
Compute α
t
, Θ
) = min{1,
P
|X )Q
(t)
)
P|X )Q
)
}
If u < α(., .), set Θ
t+1
= Θ
; else Θ
t+1
= Θ
t
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Three Questions
Choice of proposal distribution Q(Θ): not the focus of the
talk
Is the probability α
t
, Θ
) = min{1,
P
|X )Q
(t)
)
P|X )Q
)
} easy
to compute?
P
|X )
P
t
|X )
=
P(X |Θ
)P
)
P(X )
P(X |Θ
t
)P
t
)
P(X )
=
P(X |Θ
)P
)
P(X |Θ
t
)P
t
)
Why α
t
, Θ
) works? Focus of the talk
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Three Questions
Choice of proposal distribution Q(Θ): not the focus of the
talk
Is the probability α
t
, Θ
) = min{1,
P
|X )Q
(t)
)
P|X )Q
)
} easy
to compute?
P
|X )
P
t
|X )
=
P(X |Θ
)P
)
P(X )
P(X |Θ
t
)P
t
)
P(X )
=
P(X |Θ
)P
)
P(X |Θ
t
)P
t
)
Why α
t
, Θ
) works? Focus of the talk
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Three Questions
Choice of proposal distribution Q(Θ): not the focus of the
talk
Is the probability α
t
, Θ
) = min{1,
P
|X )Q
(t)
)
P|X )Q
)
} easy
to compute?
P
|X )
P
t
|X )
=
P(X |Θ
)P
)
P(X )
P(X |Θ
t
)P
t
)
P(X )
=
P(X |Θ
)P
)
P(X |Θ
t
)P
t
)
Why α
t
, Θ
) works? Focus of the talk
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Three Questions
Choice of proposal distribution Q(Θ): not the focus of the
talk
Is the probability α
t
, Θ
) = min{1,
P
|X )Q
(t)
)
P|X )Q
)
} easy
to compute?
P
|X )
P
t
|X )
=
P(X |Θ
)P
)
P(X )
P(X |Θ
t
)P
t
)
P(X )
=
P(X |Θ
)P
)
P(X |Θ
t
)P
t
)
Why α
t
, Θ
) works? Focus of the talk
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Markov chain review
Consider a Markov chain defined on a continuous state with a
transition kernel P(x, A)
P(x, A): the probability of moving point x R
d
to set A
P(x, R
d
) = 1
P(x, {x}) not necessarily 0
Under some conditions, there exists a stationary distribution of
this Markov chain:
π
(A) =
Z
R
d
P(x, A)π(x)dx (4)
π
: the stationary distribution
A: some set in R
d
π: density of the stationary distribution
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Markov chain review
Consider a Markov chain defined on a continuous state with a
transition kernel P(x, A)
P(x, A): the probability of moving point x R
d
to set A
P(x, R
d
) = 1
P(x, {x}) not necessarily 0
Under some conditions, there exists a stationary distribution of
this Markov chain:
π
(A) =
Z
R
d
P(x, A)π(x)dx (4)
π
: the stationary distribution
A: some set in R
d
π: density of the stationary distribution
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
How does Markov chain relate to P|x) ?
MH-algorithm simulates a random walk on a Markov chain
with some probability transitional kernel P(x, y), which has
stationary distribution P|X ). Each step of the walk is
accepted with some probability α(., .)
Question:
How do we construct transitional kernel P(x, y), so that
the Markov chain can converge to the posterior?
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
How does Markov chain relate to P|x) ?
MH-algorithm simulates a random walk on a Markov chain
with some probability transitional kernel P(x, y), which has
stationary distribution P|X ). Each step of the walk is
accepted with some probability α(., .)
Question:
How do we construct transitional kernel P(x, y), so that
the Markov chain can converge to the posterior?
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
”The transition Kernel” : P(x, A)
Consider the following definition:
P(x, A) =
Z
A
p(x, y )dy + r(x)δ
x
(A) (5)
p(x, y ) : some function such that p(x, x) = 0
δ
x
(A) = 1 if x A, and 0 otherwise
r(x) = 1
R
R
d
p(x, y )dy
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Proof the transition kernel works
Recall our kernel: P(x, A) =
R
A
p(x, y )dy + r(x)δ
x
(A)
Detailed balance: π(x)p(x, y) = π(y)p(y, x)
Z
P(x, A)π(x)dx =
Z Z
A
p(x, y )dy π(x)dx +
Z
r(x)δ
x
(A)π(x)dx
=
Z
A
Z
p(x, y )π(x)dxdy +
Z
r(x)δ
x
(A)π(x)dx
=
Z
A
Z
p(y, x)π(y )dxdy +
Z
r(x)δ
x
(A)π(x)dx
=
Z
A
(1 r (y ))π(y)dy +
Z
A
r(x)π(x)dx
=
Z
A
π(y)dy = π
(A)
(6)
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Proof the transition kernel works
Recall our kernel: P(x, A) =
R
A
p(x, y )dy + r(x)δ
x
(A)
Detailed balance: π(x)p(x, y) = π(y)p(y, x)
Z
P(x, A)π(x)dx =
Z Z
A
p(x, y )dy π(x)dx +
Z
r(x)δ
x
(A)π(x)dx
=
Z
A
Z
p(x, y )π(x)dxdy +
Z
r(x)δ
x
(A)π(x)dx
=
Z
A
Z
p(y, x)π(y )dxdy +
Z
r(x)δ
x
(A)π(x)dx
=
Z
A
(1 r (y ))π(y)dy +
Z
A
r(x)π(x)dx
=
Z
A
π(y)dy = π
(A)
(6)
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Proof the transition kernel works
Recall our kernel: P(x, A) =
R
A
p(x, y )dy + r(x)δ
x
(A)
Detailed balance: π(x)p(x, y) = π(y)p(y, x)
Z
P(x, A)π(x)dx =
Z Z
A
p(x, y )dy π(x)dx +
Z
r(x)δ
x
(A)π(x)dx
=
Z
A
Z
p(x, y )π(x)dxdy +
Z
r(x)δ
x
(A)π(x)dx
=
Z
A
Z
p(y, x)π(y )dxdy +
Z
r(x)δ
x
(A)π(x)dx
=
Z
A
(1 r (y ))π(y)dy +
Z
A
r(x)π(x)dx
=
Z
A
π(y)dy = π
(A)
(6)
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Finding p(x, y)
Consider a candidate-generating density q(y |x), such that
R
q(y|x)dx = 1. We will be done if we have :
π(x)q(y |x) = π(y )q(x|y)
Consider the case π(x)q(y |x) π(y )q(x|y) , we can add
a term α(x, y ) 1:
π(x)q(y |x)α(x, y) = π(y)q(x|y ) = α(x, y) =
π(y)q(x|y )
π(x)q(y |x)
In the case π(x)q(y |x) π(y )q(x|y), we can let
α(x, y ) = 1 and introduce α(y, x) on the right side
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Finding p(x, y)
Consider a candidate-generating density q(y |x), such that
R
q(y|x)dx = 1. We will be done if we have :
π(x)q(y |x) = π(y )q(x|y)
Consider the case π(x)q(y |x) π(y )q(x|y)
, we can add
a term α(x, y ) 1:
π(x)q(y |x)α(x, y) = π(y)q(x|y ) = α(x, y) =
π(y)q(x|y )
π(x)q(y |x)
In the case π(x)q(y |x) π(y )q(x|y), we can let
α(x, y ) = 1 and introduce α(y, x) on the right side
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Finding p(x, y)
Consider a candidate-generating density q(y |x), such that
R
q(y|x)dx = 1. We will be done if we have :
π(x)q(y |x) = π(y )q(x|y)
Consider the case π(x)q(y |x) π(y )q(x|y) , we can add
a term α(x, y ) 1:
π(x)q(y |x)α(x, y) = π(y)q(x|y ) = α(x, y) =
π(y)q(x|y )
π(x)q(y |x)
In the case π(x)q(y |x) π(y )q(x|y), we can let
α(x, y ) = 1 and introduce α(y, x) on the right side
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Finding p(x, y)
Consider a candidate-generating density q(y |x), such that
R
q(y|x)dx = 1. We will be done if we have :
π(x)q(y |x) = π(y )q(x|y)
Consider the case π(x)q(y |x) π(y )q(x|y) , we can add
a term α(x, y ) 1:
π(x)q(y |x)α(x, y) = π(y)q(x|y ) = α(x, y) =
π(y)q(x|y )
π(x)q(y |x)
In the case π(x)q(y |x) π(y )q(x|y), we can let
α(x, y ) = 1 and introduce α(y, x) on the right side
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
p(x, y) just found!
If we let α(x, y ) = min{1,
π(y)q(x|y)
π(x)q(y |x)
}, we can define
p(x, y ) = α(x, y )q(y |x)
Then the detailed balance equation is satisfied:
π(x)p(x, y ) = π(y )p(y , x)
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Summary
Motivation: We have trouble sampling Θ from the
posterior P|X )
Step 1: We can find a transition kernel P(x, A) of a
Markov chain whose stationary distribution is P|X )
Step 2: We consider the transitional kernel
P(x, A) =
R
A
p(x, y )dy + r(x)δ
x
(A)
Step 3: We showed the kernel works if p(x, y) fulfills
detailed balance π(x)p(x, y ) = π(y )p(y , x)
Step 4: We showed the function p(x, y ) = α(x, y )q(y |x)
satisfies detained balance.
That is: if we sample from q(y|x) and accept with
probability α(x, y), then we’ll (eventually) be sampling
from the posterior distribution .
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Summary
Motivation: We have trouble sampling Θ from the
posterior P|X )
Step 1: We can find a transition kernel P(x, A) of a
Markov chain whose stationary distribution is P|X )
Step 2: We consider the transitional kernel
P(x, A) =
R
A
p(x, y )dy + r(x)δ
x
(A)
Step 3: We showed the kernel works if p(x, y) fulfills
detailed balance π(x)p(x, y ) = π(y )p(y , x)
Step 4: We showed the function p(x, y ) = α(x, y )q(y |x)
satisfies detained balance.
That is: if we sample from q(y|x) and accept with
probability α(x, y), then we’ll (eventually) be sampling
from the posterior distribution .
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Summary
Motivation: We have trouble sampling Θ from the
posterior P|X )
Step 1: We can find a transition kernel P(x, A) of a
Markov chain whose stationary distribution is P|X )
Step 2: We consider the transitional kernel
P(x, A) =
R
A
p(x, y )dy + r(x)δ
x
(A)
Step 3: We showed the kernel works if p(x, y) fulfills
detailed balance π(x)p(x, y ) = π(y )p(y , x)
Step 4: We showed the function p(x, y ) = α(x, y )q(y |x)
satisfies detained balance.
That is: if we sample from q(y|x) and accept with
probability α(x, y), then we’ll (eventually) be sampling
from the posterior distribution .
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Summary
Motivation: We have trouble sampling Θ from the
posterior P|X )
Step 1: We can find a transition kernel P(x, A) of a
Markov chain whose stationary distribution is P|X )
Step 2: We consider the transitional kernel
P(x, A) =
R
A
p(x, y )dy + r(x)δ
x
(A)
Step 3: We showed the kernel works if p(x, y) fulfills
detailed balance π(x)p(x, y ) = π(y )p(y , x)
Step 4: We showed the function p(x, y ) = α(x, y )q(y |x)
satisfies detained balance.
That is: if we sample from q(y|x) and accept with
probability α(x, y), then we’ll (eventually) be sampling
from the posterior distribution .
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Summary
Motivation: We have trouble sampling Θ from the
posterior P|X )
Step 1: We can find a transition kernel P(x, A) of a
Markov chain whose stationary distribution is P|X )
Step 2: We consider the transitional kernel
P(x, A) =
R
A
p(x, y )dy + r(x)δ
x
(A)
Step 3: We showed the kernel works if p(x, y) fulfills
detailed balance π(x)p(x, y ) = π(y )p(y , x)
Step 4: We showed the function p(x, y ) = α(x, y )q(y |x)
satisfies detained balance.
That is: if we sample from q(y|x) and accept with
probability α(x, y), then we’ll (eventually) be sampling
from the posterior distribution .
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Summary
Motivation: We have trouble sampling Θ from the
posterior P|X )
Step 1: We can find a transition kernel P(x, A) of a
Markov chain whose stationary distribution is P|X )
Step 2: We consider the transitional kernel
P(x, A) =
R
A
p(x, y )dy + r(x)δ
x
(A)
Step 3: We showed the kernel works if p(x, y) fulfills
detailed balance π(x)p(x, y ) = π(y )p(y , x)
Step 4: We showed the function p(x, y ) = α(x, y )q(y |x)
satisfies detained balance.
That is: if we sample from q(y|x) and accept with
probability α(x, y), then we’ll (eventually) be sampling
from the posterior distribution .
Understanding
the
Metropolis-
Hastings
Algorithm
Jiguang Li
Introductions
MH Algorithm
Derivations
Summary
Reference
S.Chib and E.Greenberg (1995) Understanding the
Metropolis-Hastings Algorithm