Notes on Latent Variable Models¶

All the notes below are from Prof. David M. Blei's paper: Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models

1. Box's Loop¶

Formulate a simple model based on the types of hidden structures that you believe exist in the data
Use an inference algorithm to approximate the posterior (e.g. MCMC, variational inference) - the condional distribution of the hidden variables given the data.
Use the posterior to test the model against the data, identifying the important ways that it succeeds and fails. Some techniques to assess a model's fitness in this setting: predictive sample reuse (PSR), and posterior predictive checks (PPcs).
If satisfied use the model to solve problem. If not, revise the model according to last step

2. Latent Variable Modeling Basics¶

Think what types of hidden quantities might be used to describe the data we are interested in.
Encode that relationship in a joint probability distribution of hidden and observed random variables.
Given an observed data set, we uncover the particular hidden quantities that describe it through the posterior (condional distribution of the hidden variables given the observations)
Use the posterior to form predictive distribution, the distribution over future data that the observations and the model imply.

3. Model Variabes Formulas¶

A model consists of three types of variables:

Observation: data point $x= x_{1:N}$
Hidden Variable: encodes hiden quantity $h=h_{1:M}$
Hyperparameters: fixed nonrandom quantity that we denote by $\eta$

Formally:

A model is a joint distribution of x and h: $p(h, x | \eta)= p(h | \eta) p(x | h)$
After we observe the data, we are interested in the conditional distribution of the hidden variables $p(h | x, \eta) \propto p(h,x | \eta)$
The conditional distribution above leads to predictive distribution: $p(x_{\text{new}}|x)= \int p(x_{\text{new}}|h)p(h| x , \eta) dh$

4. An Example¶

A mixture model assumes that data are clustered and each data point is drawn from a distribution associated with its assigned cluster. The hidden variables of the model are the cluster assignments and parameters to the per-cluster distributions.

One example is the Gaussian misture model. The variables of the model are K mixture components $\mu_{1:k}$, each of which is the mean of Gaussian distribution and a set of mixture proportions $\theta$, a nonnegative $K$ vector that sums to one. Data arise by first choosing a component assignment $z_n$ (an index from 1 to K) from the mixture proportions and then drawing the data point $x_n$ from the corresponding Gaussian $x_i | z_i \sim N (\mu_{z_i}, 1)$. The hidden variables for this model are the mixture assignments for each data point $z_{1:N}$, the set of mixture component means $\mu_{1:k}$, and the mixture proportions $\theta$. To complete the model, we place distributions (priors) on the mixture components (e.g. Gaussian) and the mixture proportions (e.gg. Dirichlet). The fixed hyperparameters $\eta$ are the parameters to these distributions.

5. Three Ways of Specifying a Model¶

5.1 The Generative Probabilistic Process¶

The Generative Process for the Gaussian Mixture Model:

Draw mixture proportions $\theta \sim \text{Dirichlet($\alpha$)}$
For each mixture component $k$, draw $\mu_k \sim N(0, \sigma_0^2)$
For each data point n:
- Draw mixture assignment $z_n | \theta \sim \text{Discrete}(\theta)$
- Draw data point $x_n | z_n$ , $\mu \sim N(\mu_{z_n}, 1)$

Parameters Classifications:

Hyperparameters: $\sigma_0^2$, $\alpha$
Global Variables: $\theta$, and the mixture components $\mu = \mu_{1:K}$
Local Variables: $z_i$

5.2 The Joint Distribution¶

The traditional way of representing a model is with the factored joint distribution of its hidden and observed variables. For the Gaussian mixture case, the joint distribution is $$p(\theta, \mu, z, x | \sigma_0^2, \alpha)= p(\theta | \alpha) \prod_{k=1}^{K} p(\mu_k | \sigma_0^2) \prod_{i=1}^{N} (p(z_i | \theta) p(x_1 | z_i, \mu))$$

In particular:

$p(\theta |a)$ is the Dirichlet density
$p(\mu_k | \sigma_0)$ is the Gaussian density
$p(z_i | \theta) = \theta_{z_i}$ is discrtete distribution
$p(x_i | z_i, \mu)$ is a Gaussian density centered at the $z_i$th mean.

Observe: local variables have terms inside the product over N data points, whereas global variables have terms outside of the product.

We can use the joint distribution above to calculate the posterior fir the Gaussian mixture model: $$p(\theta, \mu, z | x, \sigma_0^2, \alpha)= \frac{p(\theta, \mu, z, x | \sigma_0^2, \alpha)}{p(x | \sigma^2, \alpha)}$$

We call the denomitor of the equation above the marginal probablity of the data, known as evidence. For many applications, the evidence is hard to compute. Developing approximation to the posteriors is the focus of modern Bayesian statistics.

To make predictions, we can use the posterior (over the global variables $\theta$ and $\mu$) to posterior predictive distribution of future data: $$p(x_{\text{new}} | x, \sigma_0^2, \alpha)= \int (\sum_{z_{\text{new}}} p(z_{\text{new}}| \theta) p(x_{\text{new}} |z_{\text{new}}, \mu, \sigma^2)) p(\theta, \mu | x, \sigma_{0}^2, \alpha) d\theta d \mu$$

Note the inner sum marginalizes out the local hidden variables for the new data point, conditional on the global hidden variables. The outer integral marginalizes out the global hidden variables, conditional on the data.

5.3 Graphical Models¶

Formally, a graphical model represents the family of distributions that respects the independencies it implies:

6. Example Models¶

6.1 Linear Factor Models¶

In a factor model, there is a set of hidden components, and each data point is associated with a hidden vector of weights, with one weight for each component.

Suppose the data are $p-$dimensional, the linear factor model contains $K$ components $\mu_{1:k}$, each of which is a p-vector. We can organize the components into a $K \times p$ matrix $\mu$. For each data point n, we draw a $K$-dimensional weight vector $w_n$ and then draw the data point from a distribution parameterized by $w_n^{T} \mu$.

6.2 Mixed-Membership Models¶

In a mixed-membership model, each group is drawn from a mixture , in which the mixture proportions are unique to the group and the mixture components are shared across groups.

Mixed-membership models posit a set of global mixture components. Each data group arises when we first choose a set of mixture proportions and then, for each observation in the group, choose a mixture assignment from the per-group proportions and the data point from its corresponding component. The groups share the same set of components, but each exhibits them with a different proportion.

In figure b above:

M groups, N data point, $x_{mn}$ is the n-th observation in the m-th group.
The hidden variable $\mu_k$ is an approiate parameter to a distribution of $x_{mn}$. (if the observation is discrete, then $\mu_k$ is discrete).
$p(. | \eta)$ an appropriate prior with hyperparameter $\eta$
$\theta_m$ is a hidden variable per-group mixture proportions, a point on the simplex (K-vectors positive summing to 1)
$z_{mn}$ per observation mixture assignments.

6.3 Matrix Factorization Models:¶

Recall Netflix challenge: in which each row is a user and each column is a movie ....

A matrix factorization model uses hidden variables to embed both the rows and the columns in a low-dimensional space. Each observed cell is modeled by a distribution whose parameters are a linear combination of the row embedding and column embedding. I.e. cells in single row share the same row embeding but are governed by different column embeddings.

In figure c above: the overlapping plates show how the observations at the nth row share its embedding $w_n$ but use different variables $\gamma_m$ for each column

6.4 Hidden Markov Models¶

In an HMM, each observation of the time series is drawn from an unobserved mixture component, and that component is drawn conditional on the previous observation's component. The posterior is similar to a mixture model, but one in chich the model assumes Markovian structure on the sequence of mixture assignments.

In figure d above:

the global hidden variables are the mixture components $\mu_{1:k}$
Transition probabilities $\theta_{1:k}$. The transition probabilities provide K conditional distributions over the next component, given the value of the previous one.

7. POSTERIOR INFERENCE WITH MEAN FIELD VARIATIONAL METHODS¶

Three common posterior approximation algorithms:

Laplace Approxmation: represent the posterior as a Gaussian, derived from a Taylor approximation. It works well for simple model but are difficult for high-dimentional settings.
MCMC: include fundamental algorithms such as Metropolis-Hastings, and Gibbs sampling. Such methods form a markov chain over the hidden variables whose stationary distribution is the posterior of interest. The algorithm simulates the chain, drawing consecutive samples from its transition distribution, to collect independent samples from its stationary distribution.
Variational Inference: a deterministic alternative to MCMC in which sampling is replaced by optimization. In practice, variational inference tends to be faster than sampling methods, especially with large and high-dimensional data sets, but it has been less vigorously studied in the statistics literature. Using stochastic optimization, variational inference scales to massive data sets.

7.1 Conditionally Conjugate Models¶

To describe conditionally conjugate models, let $x = x\text{1:N}$ be observations, $\beta$ global latent variables, $z = z_{\text{1:N}}$ local latent variables, and $\eta$ fixed parameters. We assume that the joint distribution factorizes as

\begin{equation*} p(\beta, z, x | \eta) = p(\beta | \eta) \prod_{n=i}^{N} p(z_n | \beta)p(x_n | z_n, \beta) \label{eq:4} \tag{1} \end{equation*}

The posterior inference problem is to compute the condional distribution of the latent variables given the data:

\begin{equation*} p(\beta, z, | x) = \frac{p(\beta , z, x)}{\int p(\beta, z, x) dz d\beta} \label{eq:5} \tag{2} \end{equation*}

Recall the posterior is essential becasue (1) it lets us examine the hidden structures that likely generated the data and (2) and it is the gateway to prediction about new data. The difficulty in computing the posterior stems from the denominator p(x), the marginal probability of the data. For example, to compute this quantity in the mixture model we must marginalize out every possible combination of assignments of the data points to mixture components. (There are exponentially many such combinations. Hence we must approximate the posterior.

To complete our definitions, We assume that each complete conditional is in the exponential family. Recall many common distributions are in the exponential family: Gaussian, multinomial/categorical, Poisson, gamma, Bernoulli, Dirichlet, beta. Also note if the distribution of x is in an exponential family, then its density has the form:

\begin{equation*} p(x | \eta) = h(x) \exp \{ \eta^{T}t(x) - \alpha(\eta) \} \label{eq:6} \tag{3} \end{equation*}

where

t(x) is the function sufficient statistic
h(x) is the base measure
$\eta$ is the natural parameter
$\alpha (\eta)$ is the log normalizer, ensuring the density integrates to one.
The derivatives of $\alpha(\eta)$ are the cummulants of the sufficient statistic.

It follows, the complete conditional for the global variable is

\begin{equation*} p(\beta | x, z) = h(\beta) \exp \{ \eta_{g}(x, z)^{T}t(\beta) - \alpha(\eta_{g} (x, z)) \} \label{eq:7} \tag{4} \end{equation*}

The complete conditional for the local variable is

\begin{equation*} p(z_n | x_n, \beta) = h(z_n) \exp \{ \eta_{l}(x_n, \beta)^{T}t(z_n) - \alpha(\eta_{l} (x_n, \beta)) \} \label{eq:8} \tag{5} \end{equation*}

Side notes: Requiring complete conditionals in the exponential family is less stringent than requiring full conjugacy. In contrast, consider the classical Bayesian modeling setting, in which a latent variable is used as a parameter to the observations. The model is fully conjugate if the conditional distribution of the latent variable is in the same family as its prior, a property that depends on both the prior and the data-generating distribution. For example, if $\beta$ is drawn from a Dirichlet and $x_{1:N}$ are drawn from a multinomial with parameter $\beta$, then the conditional distribution of $\beta$ remains in the Dirichlet family. In complex latent variable models, however, the local variables prevent us from choosing priors to make the global variables conjugate to the observations, which is why we resort to approximate inference. However, conditioned on both the local variables and observations we can often choose prior/likelihood pairs that leave the complete conditional in the same family as the prior. Continuing with mixtures, we can use any common prior/likelihood pair (e.g., gamma/Poisson, Gaussian/Gaussian, Dirichlet/multinomial) to build an appropriate conditionally conjugate mixture model.

7.2 Mean Field Variational Inference¶

Mean field inference is a fast and effective method for obtaining approximate posteriors. The idea is to posit a family of distributions over the latent variableswith free parameters (termed variational parameters) and then fit those parameters to find the member of the family that is close to the posterior; closeness is measured by Kullback–Leibler (KL) divergence.

The variational objective function

We denote the variational family over the latent variables by $q( \beta, z | v)$, where $ν$ are the free variational parameters that index the family. The goal of variational inference is to find the optimal variational parameters by solving

\begin{equation*} v^{*} = \text{argmin}_{v} \text{KL} (q(\beta, z | v) | | p(\beta, z | x)) \label{eq:9} \tag{6} \end{equation*}

Unfortunately, computing the KL divergence implicitly requires computing p(x),which is difficult. Recall for continuous distribution, we have $$ \text{KL} (P || Q) = \int_{- \infty}^{\infty} p(x) \log( \frac{P(x)}{ q(x)}) dx$$

In this case we can rewrite the inner part of equation (6) as

\begin{align} \int_{ \beta, z} q(\beta ,z | v) \log (\frac{q (\beta, z | v)}{p (\beta ,z | x)}) d\beta dz &= \int_{ \beta, z} q(\beta ,z | v) \log (q (\beta, z | v)) d\beta dz \\ & - \int_{ \beta, z} q(\beta ,z | v) \log (p (\beta, z | x)) d\beta dz \\ & = \mathbb E \log(q(\beta, z | v)) - \int_{ \beta, z} q(\beta ,z | v) \log (\frac{p (\beta, z , x)}{p(x)}) d\beta dz \\ & = \mathbb E \log(q(\beta, z | v))- \mathbb E \log(p(\beta, z, x | \eta)) + \log(p(x)) \end{align}

Hence variational inference optimzes (maximize) the related objective function

\begin{equation*} \mathcal{L} (v) = \mathbb E \log(p(\beta, z, x | \eta)) - \mathbb E \log(q(\beta, z | v)) \label{eq:10} \tag{7} \end{equation*}

Where all expectations are taken with respect to the variational distribution. Intuitively, the first term values variational distributions that place mass on latent variable configurations that make the data likely; the second term, which is the entropy of the variational distribution, values diffuse variational distributions.

The Mean Field Variational Family

Before optimizing the objective, we must specify the variational family, in which each latent variable is independent and governed by its own variational parameter. Let the variational parameters $ν = {\lambda, \phi_{1:N} }$, where $\lambda$ is a parameter to the global variable and $\phi_{1:N}$ are parameters to the local variables. The mean field family is \begin{equation*} q(\beta, z | v) = q(\beta | \lambda) \prod_{n=1}^{N} q(z_n | \phi_n) \label{eq:11} \tag{8} \end{equation*}

Note:

Each variable is independent but that the variables are not identically distributed
Although it cannot capture correlations between variables, this family is very flexible and can focus its mass on any complex configuration of them.
the data do not appear in the equation above

To complete the specification, we set each variational factor to be in the same family as the corresponding complete conditional in the model (Equations 4 and 5). If $p(\beta | x, z)$ is a Gaussian, then λ are free Gaussian parameters; if $p( \phi_n | x_n, z)$ is discrete over K elements, then $\phi_n$ is a free distribution over K elements.

Coordinate Ascent Variational Inference

Now we want to optimize equation 7 . In coordinate inference, we iteratively optimize each variational parameter while holding all the other variational parameters fixed.

Algorithm: (Derivations can be found here)

Initialize global parameters $\lambda$ randomly
Repeat until the objective converges:
- for each data point, update local parameter $\phi_n$ using $\phi_n^{*}= \mathbb E_{q} [\eta_{l} (\beta, x_n)]$
- Update the global $\lambda$ using $\lambda^{*} = \mathbb E_{q} [\eta_g (z, x)]$

This algorithm provably goes uphill in the variational objective function, leading to a local optimum. In practice, one uses multiple random restarts to find a good local optimum. Also note the coordinate-ascent variational inference relates closely to the expectation maximization algorithm of Dempster et al. (1977).

Example

For Gaussian mixture model. The latent variables are the mixture assignments $z_{1:N}$ and the mixture components $\mu_{1:k}$. The mean field variational family is $$q(\mu,z)= \prod_{k=1}^{K} q(\mu_k | \lambda_k) \prod_{n=1}^{N} q (x_n | \phi_n)$$

where

$\lambda_k$: global variational paramenters, Gaussian variational means, which describe the distribution over each mixture components
$\phi_n$: local variational parameters, discrte distributions over $K$ elements

Sooo

In step 2a, we estimate the approximate posterior distribution of the mixture assignment for each data point;
in step 2b, we reestimate the locations of the mixture components.

Extensions The coordinate ascent inference algorithm described above is the simplest variational inference algorithm. Structured variational inference (Saul & Jordan 1996) relaxes the mean field assumption; the variational distribution allows some dependencies among the variables. Nonconjugate variational inference (Knowles & Minka 2011, Wang & Blei 2013) relaxes the assumption that the complete conditionals are in the exponential family.These algorithms generalize previous research on nonconjugate models, which required specialized approximations for the particular model at hand.

8 Model Criticism¶

Typically, we perform two types of tasks with a model:

Exploration: use our inferences about the hidden variables—usually through approximate posterior expectations—to summarize the data, visualize the data, or facet the data, that is, divide it into groups and structures dictated by the inferences. Examples of exploratory tasks include using topic models to navigate large collections of documents and using clustering models on microarray data to suggest groups of related genes.
Prediction: forecase future data by using the posterior predictive distribution: $$p(x_{\text{new}} | x) = \int p(\beta | x) (\int p(z_{\text{new}} | \beta) p(x_{\text{new}} | z_{\text{new}}, \beta) dz_{\text{new}}) d \beta$$ Usually, the posterior $p(\beta | x)$ is not available, so we substitute an approximation, $q(\beta)$, which we find by an approximate inference algorithm such as MCMC or variational inference.

We review two useful general techniques: predictive likelihood with sample-reuse (Geisser 1975) and posterior predictive checks (Box 1980, Rubin 1984, Meng 1994, Gelman et al. 1996).

8.1: Predictive Sample Reuse¶

Geisser (1975) proposed predictive sample reuse (PSR), which amounts to using cross-validation to estimate this probability. Let $x_{[n]}$ be the data set with the nth item removed. Suppose our model is $p(\beta, z, x)$ and we use variational inference to approximate the posterior $p(\beta, z | x_{[n]})$ with $q_{[n]}(\beta, z)$. The log predictive likelihood for the nth data point is $$l_n= \log p(x_n | x_{[n]})= \log \int (\int p(x_n | z_n)q(z_n) dz_n) q_{[n]}(\beta) d \beta$$

The full predictive likelihood is $\sum_{i=1}^{N} l_n$ It estimates the held-out log probability of new data with leave-one-out cross-validation. In practice, we can use K-fold cross-validation to estimate the score. We divide our data into K groups; we iteratively hold each group out and approximate the posterior over the global variables $\beta$ with the rest of the data; finally, we compute $l_n$ for each data point, integrating over the approximate posterior estimated from data that do not contain the nth point.

One advantage of PSR is that it does not simply help evaluate the modeling assumptions but also places all approximate inference algorithms on the same playing field. the log predictive probability of Equation above evaluates the predictive distribution no matter how it was approximated. PSR easily lets us compare two different approximations of the same predictive distribution.

8.2: Posterior Predictive Checks¶

Posterior predictive checks (PPCs) (Guttman 1967, Box 1980, Rubin 1984, Meng 1994, Gelman et al. 1996) help answer the important question, “Is my model good enough in the ways that matter?” In a PPC, we locate the observed data in its posterior predictive distribution. If we find that our observed data are not typical—that is, they have low probability under the posterior predictive distribution—then there may be an issue with the model.

We define a discrepancy $T(X)$ to be a function of the data that matters to us; it is a property of the data that we hope our model captures in its predictive distribution (The function T(X) is also called a test statistic.) Let $x^{rep}$ be a random set of new hypothetical future observations, a data set drawn from the posterior predictive distribution. Then, $$PPC = P(T (X ^{rep}) > T (x) | x).$$ Note the only random variable is the expression $x^{rep}$. This PPC is the probability that the replicated data are far away (in terms of T ) from the observations.

An important development from Meng (1994) is to define the discrepancy as a function of both the data and the latent variables T (x, β). Then $$PPC = P(T (X ^{rep}, \beta) > T (x, \beta) | x).$$ Here, both the hidden variables β and the replicated data x^{rep} are random. Their joint distribution is a product of the posterior and data-generating distribution: $$p(β, x^{rep} | x) = p(β | x) p(x^{rep} | β)$$. Thus, the PPC can be decomposed as an expectation of an indicator: $$PPC = \int p(\beta | x) \int p(x^{rep} | \beta) \mathbb 1 [T(x^{rep}, \beta) > T(x, \beta)] dx^{rep}d \beta$$

A key advantage of the PPC methodology is that it is adaptable to the particular needs of the practitioner. For example, the discrepancy can be themedian of the log probabilities to account for outliers, a set of conditional probabilities if that is how the model will be used, or weighted toward the types of observations that are important to model well in the application at hand. Furthermore, we can look at multiple discrepancies, even reusing computation to do so, to understand trade-offs involved among various models.

Reference¶

Blei DM. 2014. Build, compute, critique, repeat: data analysis with latent variable models. Annu. Rev. Stat. Appl. 1:203–32