All the notes below are from Prof. David M. Blei's paper: Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models
A model consists of three types of variables:
Formally:
A mixture model assumes that data are clustered and each data point is drawn from a distribution associated with its assigned cluster. The hidden variables of the model are the cluster assignments and parameters to the per-cluster distributions.
One example is the Gaussian misture model. The variables of the model are K mixture components $\mu_{1:k}$, each of which is the mean of Gaussian distribution and a set of mixture proportions $\theta$, a nonnegative $K$ vector that sums to one. Data arise by first choosing a component assignment $z_n$ (an index from 1 to K) from the mixture proportions and then drawing the data point $x_n$ from the corresponding Gaussian $x_i | z_i \sim N (\mu_{z_i}, 1)$. The hidden variables for this model are the mixture assignments for each data point $z_{1:N}$, the set of mixture component means $\mu_{1:k}$, and the mixture proportions $\theta$. To complete the model, we place distributions (priors) on the mixture components (e.g. Gaussian) and the mixture proportions (e.gg. Dirichlet). The fixed hyperparameters $\eta$ are the parameters to these distributions.
The Generative Process for the Gaussian Mixture Model:
Parameters Classifications:
The traditional way of representing a model is with the factored joint distribution of its hidden and observed variables. For the Gaussian mixture case, the joint distribution is $$p(\theta, \mu, z, x | \sigma_0^2, \alpha)= p(\theta | \alpha) \prod_{k=1}^{K} p(\mu_k | \sigma_0^2) \prod_{i=1}^{N} (p(z_i | \theta) p(x_1 | z_i, \mu))$$
In particular:
Observe: local variables have terms inside the product over N data points, whereas global variables have terms outside of the product.
We can use the joint distribution above to calculate the posterior fir the Gaussian mixture model: $$p(\theta, \mu, z | x, \sigma_0^2, \alpha)= \frac{p(\theta, \mu, z, x | \sigma_0^2, \alpha)}{p(x | \sigma^2, \alpha)}$$
We call the denomitor of the equation above the marginal probablity of the data, known as evidence. For many applications, the evidence is hard to compute. Developing approximation to the posteriors is the focus of modern Bayesian statistics.
To make predictions, we can use the posterior (over the global variables $\theta$ and $\mu$) to posterior predictive distribution of future data: $$p(x_{\text{new}} | x, \sigma_0^2, \alpha)= \int (\sum_{z_{\text{new}}} p(z_{\text{new}}| \theta) p(x_{\text{new}} |z_{\text{new}}, \mu, \sigma^2)) p(\theta, \mu | x, \sigma_{0}^2, \alpha) d\theta d \mu$$
Note the inner sum marginalizes out the local hidden variables for the new data point, conditional on the global hidden variables. The outer integral marginalizes out the global hidden variables, conditional on the data.
Formally, a graphical model represents the family of distributions that respects the independencies it implies:
In a factor model, there is a set of hidden components, and each data point is associated with a hidden vector of weights, with one weight for each component.
Suppose the data are $p-$dimensional, the linear factor model contains $K$ components $\mu_{1:k}$, each of which is a p-vector. We can organize the components into a $K \times p$ matrix $\mu$. For each data point n, we draw a $K$-dimensional weight vector $w_n$ and then draw the data point from a distribution parameterized by $w_n^{T} \mu$.
In a mixed-membership model, each group is drawn from a mixture , in which the mixture proportions are unique to the group and the mixture components are shared across groups.
Mixed-membership models posit a set of global mixture components. Each data group arises when we first choose a set of mixture proportions and then, for each observation in the group, choose a mixture assignment from the per-group proportions and the data point from its corresponding component. The groups share the same set of components, but each exhibits them with a different proportion.
In figure b above:
Recall Netflix challenge: in which each row is a user and each column is a movie ....
A matrix factorization model uses hidden variables to embed both the rows and the columns in a low-dimensional space. Each observed cell is modeled by a distribution whose parameters are a linear combination of the row embedding and column embedding. I.e. cells in single row share the same row embeding but are governed by different column embeddings.
In figure c above: the overlapping plates show how the observations at the nth row share its embedding $w_n$ but use different variables $\gamma_m$ for each column
In an HMM, each observation of the time series is drawn from an unobserved mixture component, and that component is drawn conditional on the previous observation's component. The posterior is similar to a mixture model, but one in chich the model assumes Markovian structure on the sequence of mixture assignments.
In figure d above:
Three common posterior approximation algorithms:
To describe conditionally conjugate models, let $x = x\text{1:N}$ be observations, $\beta$ global latent variables, $z = z_{\text{1:N}}$ local latent variables, and $\eta$ fixed parameters. We assume that the joint distribution factorizes as
\begin{equation*} p(\beta, z, x | \eta) = p(\beta | \eta) \prod_{n=i}^{N} p(z_n | \beta)p(x_n | z_n, \beta) \label{eq:4} \tag{1} \end{equation*}The posterior inference problem is to compute the condional distribution of the latent variables given the data:
\begin{equation*} p(\beta, z, | x) = \frac{p(\beta , z, x)}{\int p(\beta, z, x) dz d\beta} \label{eq:5} \tag{2} \end{equation*}Recall the posterior is essential becasue (1) it lets us examine the hidden structures that likely generated the data and (2) and it is the gateway to prediction about new data. The difficulty in computing the posterior stems from the denominator p(x), the marginal probability of the data. For example, to compute this quantity in the mixture model we must marginalize out every possible combination of assignments of the data points to mixture components. (There are exponentially many such combinations. Hence we must approximate the posterior.
To complete our definitions, We assume that each complete conditional is in the exponential family. Recall many common distributions are in the exponential family: Gaussian, multinomial/categorical, Poisson, gamma, Bernoulli, Dirichlet, beta. Also note if the distribution of x is in an exponential family, then its density has the form:
\begin{equation*} p(x | \eta) = h(x) \exp \{ \eta^{T}t(x) - \alpha(\eta) \} \label{eq:6} \tag{3} \end{equation*}where
It follows, the complete conditional for the global variable is
\begin{equation*} p(\beta | x, z) = h(\beta) \exp \{ \eta_{g}(x, z)^{T}t(\beta) - \alpha(\eta_{g} (x, z)) \} \label{eq:7} \tag{4} \end{equation*}The complete conditional for the local variable is
\begin{equation*} p(z_n | x_n, \beta) = h(z_n) \exp \{ \eta_{l}(x_n, \beta)^{T}t(z_n) - \alpha(\eta_{l} (x_n, \beta)) \} \label{eq:8} \tag{5} \end{equation*}Side notes: Requiring complete conditionals in the exponential family is less stringent than requiring full conjugacy. In contrast, consider the classical Bayesian modeling setting, in which a latent variable is used as a parameter to the observations. The model is fully conjugate if the conditional distribution of the latent variable is in the same family as its prior, a property that depends on both the prior and the data-generating distribution. For example, if $\beta$ is drawn from a Dirichlet and $x_{1:N}$ are drawn from a multinomial with parameter $\beta$, then the conditional distribution of $\beta$ remains in the Dirichlet family. In complex latent variable models, however, the local variables prevent us from choosing priors to make the global variables conjugate to the observations, which is why we resort to approximate inference. However, conditioned on both the local variables and observations we can often choose prior/likelihood pairs that leave the complete conditional in the same family as the prior. Continuing with mixtures, we can use any common prior/likelihood pair (e.g., gamma/Poisson, Gaussian/Gaussian, Dirichlet/multinomial) to build an appropriate conditionally conjugate mixture model.
Mean field inference is a fast and effective method for obtaining approximate posteriors. The idea is to posit a family of distributions over the latent variableswith free parameters (termed variational parameters) and then fit those parameters to find the member of the family that is close to the posterior; closeness is measured by Kullback–Leibler (KL) divergence.
The variational objective function
We denote the variational family over the latent variables by $q( \beta, z | v)$, where $ν$ are the free variational parameters that index the family. The goal of variational inference is to find the optimal variational parameters by solving
\begin{equation*} v^{*} = \text{argmin}_{v} \text{KL} (q(\beta, z | v) | | p(\beta, z | x)) \label{eq:9} \tag{6} \end{equation*}Unfortunately, computing the KL divergence implicitly requires computing p(x),which is difficult. Recall for continuous distribution, we have $$ \text{KL} (P || Q) = \int_{- \infty}^{\infty} p(x) \log( \frac{P(x)}{ q(x)}) dx$$
In this case we can rewrite the inner part of equation (6) as
\begin{align} \int_{ \beta, z} q(\beta ,z | v) \log (\frac{q (\beta, z | v)}{p (\beta ,z | x)}) d\beta dz &= \int_{ \beta, z} q(\beta ,z | v) \log (q (\beta, z | v)) d\beta dz \\ & - \int_{ \beta, z} q(\beta ,z | v) \log (p (\beta, z | x)) d\beta dz \\ & = \mathbb E \log(q(\beta, z | v)) - \int_{ \beta, z} q(\beta ,z | v) \log (\frac{p (\beta, z , x)}{p(x)}) d\beta dz \\ & = \mathbb E \log(q(\beta, z | v))- \mathbb E \log(p(\beta, z, x | \eta)) + \log(p(x)) \end{align}Hence variational inference optimzes (maximize) the related objective function
\begin{equation*} \mathcal{L} (v) = \mathbb E \log(p(\beta, z, x | \eta)) - \mathbb E \log(q(\beta, z | v)) \label{eq:10} \tag{7} \end{equation*}Where all expectations are taken with respect to the variational distribution. Intuitively, the first term values variational distributions that place mass on latent variable configurations that make the data likely; the second term, which is the entropy of the variational distribution, values diffuse variational distributions.
The Mean Field Variational Family
Before optimizing the objective, we must specify the variational family, in which each latent variable is independent and governed by its own variational parameter. Let the variational parameters $ν = {\lambda, \phi_{1:N} }$, where $\lambda$ is a parameter to the global variable and $\phi_{1:N}$ are parameters to the local variables. The mean field family is \begin{equation*} q(\beta, z | v) = q(\beta | \lambda) \prod_{n=1}^{N} q(z_n | \phi_n) \label{eq:11} \tag{8} \end{equation*}
Note:
To complete the specification, we set each variational factor to be in the same family as the corresponding complete conditional in the model (Equations 4 and 5). If $p(\beta | x, z)$ is a Gaussian, then λ are free Gaussian parameters; if $p( \phi_n | x_n, z)$ is discrete over K elements, then $\phi_n$ is a free distribution over K elements.
Coordinate Ascent Variational Inference
Now we want to optimize equation 7 . In coordinate inference, we iteratively optimize each variational parameter while holding all the other variational parameters fixed.
Algorithm: (Derivations can be found here)
This algorithm provably goes uphill in the variational objective function, leading to a local optimum. In practice, one uses multiple random restarts to find a good local optimum. Also note the coordinate-ascent variational inference relates closely to the expectation maximization algorithm of Dempster et al. (1977).
Example
For Gaussian mixture model. The latent variables are the mixture assignments $z_{1:N}$ and the mixture components $\mu_{1:k}$. The mean field variational family is $$q(\mu,z)= \prod_{k=1}^{K} q(\mu_k | \lambda_k) \prod_{n=1}^{N} q (x_n | \phi_n)$$
where
Sooo
Extensions The coordinate ascent inference algorithm described above is the simplest variational inference algorithm. Structured variational inference (Saul & Jordan 1996) relaxes the mean field assumption; the variational distribution allows some dependencies among the variables. Nonconjugate variational inference (Knowles & Minka 2011, Wang & Blei 2013) relaxes the assumption that the complete conditionals are in the exponential family.These algorithms generalize previous research on nonconjugate models, which required specialized approximations for the particular model at hand.
Typically, we perform two types of tasks with a model:
We review two useful general techniques: predictive likelihood with sample-reuse (Geisser 1975) and posterior predictive checks (Box 1980, Rubin 1984, Meng 1994, Gelman et al. 1996).
Geisser (1975) proposed predictive sample reuse (PSR), which amounts to using cross-validation to estimate this probability. Let $x_{[n]}$ be the data set with the nth item removed. Suppose our model is $p(\beta, z, x)$ and we use variational inference to approximate the posterior $p(\beta, z | x_{[n]})$ with $q_{[n]}(\beta, z)$. The log predictive likelihood for the nth data point is $$l_n= \log p(x_n | x_{[n]})= \log \int (\int p(x_n | z_n)q(z_n) dz_n) q_{[n]}(\beta) d \beta$$
The full predictive likelihood is $\sum_{i=1}^{N} l_n$ It estimates the held-out log probability of new data with leave-one-out cross-validation. In practice, we can use K-fold cross-validation to estimate the score. We divide our data into K groups; we iteratively hold each group out and approximate the posterior over the global variables $\beta$ with the rest of the data; finally, we compute $l_n$ for each data point, integrating over the approximate posterior estimated from data that do not contain the nth point.
One advantage of PSR is that it does not simply help evaluate the modeling assumptions but also places all approximate inference algorithms on the same playing field. the log predictive probability of Equation above evaluates the predictive distribution no matter how it was approximated. PSR easily lets us compare two different approximations of the same predictive distribution.
Posterior predictive checks (PPCs) (Guttman 1967, Box 1980, Rubin 1984, Meng 1994, Gelman et al. 1996) help answer the important question, “Is my model good enough in the ways that matter?” In a PPC, we locate the observed data in its posterior predictive distribution. If we find that our observed data are not typical—that is, they have low probability under the posterior predictive distribution—then there may be an issue with the model.
We define a discrepancy $T(X)$ to be a function of the data that matters to us; it is a property of the data that we hope our model captures in its predictive distribution (The function T(X) is also called a test statistic.) Let $x^{rep}$ be a random set of new hypothetical future observations, a data set drawn from the posterior predictive distribution. Then, $$PPC = P(T (X ^{rep}) > T (x) | x).$$ Note the only random variable is the expression $x^{rep}$. This PPC is the probability that the replicated data are far away (in terms of T ) from the observations.
An important development from Meng (1994) is to define the discrepancy as a function of both the data and the latent variables T (x, β). Then $$PPC = P(T (X ^{rep}, \beta) > T (x, \beta) | x).$$ Here, both the hidden variables β and the replicated data x^{rep} are random. Their joint distribution is a product of the posterior and data-generating distribution: $$p(β, x^{rep} | x) = p(β | x) p(x^{rep} | β)$$. Thus, the PPC can be decomposed as an expectation of an indicator: $$PPC = \int p(\beta | x) \int p(x^{rep} | \beta) \mathbb 1 [T(x^{rep}, \beta) > T(x, \beta)] dx^{rep}d \beta$$
A key advantage of the PPC methodology is that it is adaptable to the particular needs of the practitioner. For example, the discrepancy can be themedian of the log probabilities to account for outliers, a set of conditional probabilities if that is how the model will be used, or weighted toward the types of observations that are important to model well in the application at hand. Furthermore, we can look at multiple discrepancies, even reusing computation to do so, to understand trade-offs involved among various models.
Blei DM. 2014. Build, compute, critique, repeat: data analysis with latent variable models. Annu. Rev. Stat. Appl. 1:203–32