Representation Learning with Latent Variable Models
In this post, I discuss what makes the posterior distribution of a directed latent variable model a useful representation. Also, I raise questions that deserve careful study in the case of undirected latent variable models.
- Uninformative latent variables in LVMs
- Information theoretical effects of maximizing ELBO
- Open questions: What about undirected LVMs?
As Ferenc Huszár discussed in his blog post, training a latent variable model (LVM) via maximum likelihood estimation (MLE) does not necessarily lead to a “useful” representation. Actually, in the infinite model family limit, learning a “useful” representation of your data and achieving a high likelihood value (or generating high quality samples) are two orthogonal directions. For VAEs, what makes their latent representation useful is their auto-encoder based model structure as well as the ELBO maximization objective (Tschannen et al., 2018). However, what makes a useful representation in the case of undirected LVMs is still mysterious to me.
Uninformative latent variables in LVMs
In the literature, it is reported that in the context of VAEs, a power decoder like PixelCNN can generate high-quality samples, however the mutual information between the generated sample and the conditioned latent variable is very low, this is fatal to representation learning since the representation now contains almost no information of the original data. This is intuitive if you consider a LVM , and when your decoder remains to be a powerful density estimator (e.g. an EBM) on after setting the weights connecting to the latent variable to 0, your LVM implicitly becomes to which leads to independence and zero mutual information.
I also come up with another way to approximately explain the above effect. Let’s take the latent variable model as a mixture distribution and we would like to use its marginal to model the . Consider the learning process as minimizing the forward KL-divergence (although MLE is minimizing the reverse KL-divergence, conceptually we can think they have similar effect on the model): this gives us an upper bound of the mutual information between and : thus when your decoder itself can model well in the sense of averaged forward KL divergence, then little information about is contained in .
Information theoretical effects of maximizing ELBO
When training a VAE, since the marginal distribution is intractable for direct MLE, we turn to maximizing the evidence lower bound (ELBO): the term A is the reconstruction term, which can increase the mutual information between and , and term B serves as a regularizer to enforce disentangled posterior.
Let’s further examine term B. First, define the aggregated posterior as: , then we have: In this way, the regularizer over the posterior distribution becomes to: and it will penalize the model for high mutual information between and . Note that without this, the model will try its best to encode everything about into to achieve low reconstruction error, this will not lead to a useful representation.
In (Alemi et al., 2018), the authors analyze the ELBO with the rate-distortion theory, where distortion (D) and rate (R) are defined as: Note that when we train a VAE by maximizing the ELBO, we are maximizing the sum of and . And as shown in the paper, when your decoder is pretty powerful, the term (which is equal to ) can be pushed to 0, this leads because and both KL divergence and mutual information are non-negative terms. Adding a weighting parameter between and leads to a tradeoff: and this reveals the -VAE objective (Higgins et al., 2017).
Open questions: What about undirected LVMs?
The original hope of generative representation learning is that if we can create all the data that we have seen, then we implicitly may learn a representation that can be used to answer any question about the data. And in the sense of generative modeling, both directed and undirected latent variable models can perform well.
However, the undirected LVMs (i.e. energy-based latent variable models) are not trained by maximizing the ELBO, thus the above theory is not capable of analyzing its representational property. And it is interesting to study on the following questions:
- What makes the learned representation in EBLVMs useful? The model or the learning algorithm? In the above discussion on the directed LVMs, we can say it’s the auto-encoder based structure accompany with the ELBO maximization objective makes the representation useful. Does a bipartite graphical (like RBMs) structure play a similar role as the auto-encoder struture in VAEs?
- In (Wu et al., 2021) (Liao et al., 2022) (Lee et al., 2023), the latent variables of the EBLVMs can be marginalized and the marginal energy functions are analytically available. This enables us to train these models as common EBMs with no latent variables. When the training procedure has no explicit connection with , the representational usefulness of highly depends on the extra constrain provided by the model structure. It is reported in (Wu et al., 2021) that posterior collapse may happen in their conjugate energy-based model and the mutual information between and the encoded is low, while in (Liao et al., 2022), when the GRBM is trained with the marginal energy, it tends to map to an almost deterministic latent code which preserves high mutual information. So different design of joint energy function leads to different level of coupling between and , then which part of an EBLVM is essential for that kind of coupling? Does the problem have any connection with latent variable identifiability (Wang et al., 2021)?
- High mutual information between and does not necessarily lead to a good representation, can we establish similar rate-distortion theory in the context of undirected LVMs as in (Alemi et al., 2018) (Tschannen et al., 2018)?
- Tschannen, M., Bachem, O., & Lucic, M. (2018). Recent advances in autoencoder-based representation learning. ArXiv Preprint ArXiv:1812.05069.
- Alemi, A., Poole, B., Fischer, I., Dillon, J., Saurous, R. A., & Murphy, K. (2018). Fixing a broken ELBO. International Conference on Machine Learning, 159–168.
- Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., & Lerchner, A. (2017). beta-vae: Learning basic visual concepts with a constrained variational framework. International Conference on Learning Representations.
- Wu, H., Esmaeili, B., Wick, M., Tristan, J.-B., & Van De Meent, J.-W. (2021). Conjugate Energy-Based Models. International Conference on Machine Learning, 11228–11239.
- Liao, R., Kornblith, S., Ren, M., Fleet, D. J., & Hinton, G. (2022). Gaussian-Bernoulli RBMs Without Tears. ArXiv Preprint ArXiv:2210.10318.
- Lee, H., Jeong, J., Park, S., & Shin, J. (2023). Guiding Energy-based Models via Contrastive Latent Variables. ArXiv Preprint ArXiv:2303.03023.
- Wang, Y., Blei, D., & Cunningham, J. P. (2021). Posterior collapse and latent variable non-identifiability. Advances in Neural Information Processing Systems, 34, 5443–5455.