Generative Models 001

for EECS 182, UC Berkeley

One view of machine learning is that it is the study of algorithms that make inference about probability distributions. In this view, the dataset is viewed as a sample from some underlying unknown probability distribution. That is for some unknown probability distribution (P,X), defined on a set X, we have a dataset D={x1,,xn}, where each xiP. Given samples from the unknown distribution, we can define different inference tasks that aim to reason about the underlying distribution. Some examples include:

Generative Models

We focus on three classes of generative models:

Unsupervised Representation Learning

Consider the setting where you have access to large amounts of unlabelled images from the internet, D={x1,,xn}, where each xiRd. How do we learn meaningful representations of the data?

Contractive Autoencoder

One approach is performing non-linear dimensionality reduction using neural networks. For example, with a contractive autoencoder, we learn an encoder fθ(x):RdRk and a decoder gϕ(z):RkRd, such that gϕ(fθ(x))x.

Denoising Autoencoder

As we've seen in lecture, data augmentations present a powerful tool for improving the generalization of neural networks. An instantiation of this in the autoencoder setting is the denoising autoencoder Extracting and Composing Robust Features with Denoising Autoencoder, Vincent et al 2008 . In this setting, we corrupt the input x with noise ϵ and learn an encoder fθ(x):RdRk and a decoder gϕ(z):RkRd, such that gϕ(fθ(x))x.

Comparing Distributions

One of the core objectives of generative models is to learn pθ that is close to pdata. This requires defining a notion of closeness between two distributions. Consider the setting where we have two distributions P(x) and Q(x), that share the same support X. Metrics that are commonly used to compare two distributions include:

Note that the KL divergence is generally asymmetric, i.e. KL(P||Q)KL(Q||P).

Variational Autoencoder

VAEs fall in the class of likelihood-based models. The goal is to learn a model pθ(x) that is close to the true distribution pdata. This is done by maximizing the likelihood of the observed data under the model, i.e. maxθExpdata[logpθ(x)].

Evidence Lower Bound

Given a latent-variable model, zx, we have an alternative objective to maximize the likelihood of the data, i.e. maxθExpdata[logpθ(x)]. In particular, consider the expression logpθ(x)=logzpθ(x,z)dz=logzpθ(x|z)p(z)dz The above formulation, is equivalent to sampling latent-variables z from the prior p(z) and then computing the likelihood of the data given the latent variables. However, when the latent-variables are continuous, estimating the above expression requires integrating over the entire latent space. As this is intractable in practice, we could instead sample a finite number of latent variables z1,,zn from the prior p(z) and then compute the likelihood of the data given the latent variables. This is known as the Monte Carlo approximation, where we approximate the integral with a sum: logpθ(x)logi=1npθ(x|zi)p(zi) Note that typically, for xRd,zRk with (k<d). Estimating the likelihood of the higher-dimensional data x conditioned on the lower-dimensional latent-variables z is unstable to train, due to high variance gradients. Instead we consider an alternative formulation, where we additionally want to find a paramteric approximation qϕ(z|x) of the posterior distribution p(z|x) of latents z given observed datapoints x, This is known as the variational inference framework. In particular, we have the following expression: logpθ(x)=logzpθ(x|z)p(z)dz=logzqϕ(z|x)qϕ(z|x)pθ(x|z)p(z)dz=log[Ezqϕ(z|x){p(z)qϕ(z|x)pθ(x|z)}]dz Note that the above expression is valid for all approximations of the posterior qϕ(z|x) as long as the support of qϕ(z|x) is a superset of the support of p(z). Optimizing the above expression is still difficult; instead we optimize a lower bound of the above expression, derived with help of Jensen's inequality.

Jensen's inequality for concave functions. model_select

In particular, for any concave function (log in this case), we have the following inequality: f(E[x])E[f(x)] Plugging this inequality into the variational expression, we have the following lower-bound for true likelihood: logpθ(x)Ezqϕ(z|x){logp(z)qϕ(z|x)pθ(x|z)}=Ezqϕ(z|x){logpθ(x|z)logqϕ(z|x)p(z)}=Ezqϕ(z|x)[logpθ(x|z)]Ezqϕ(z|x)[logqϕ(z|x)p(z)]=Ezqϕ(z|x)[logpθ(x|z)]Reconstruction LikelihoodKL(qϕ(z|x)||p(z))MatchingPosteriors The above approximation is called the Evidence Lower BOund (ELBO) and is easier to optimizer in practice Auto-encoding variational bayes , Kingma & Welling 2013. One way to approximate the above expression is that we want to increase the likelihood of decoding the data given the latent variables, while also matching the posterior distribution of the latent variables to a distribution from which we can sample efficiently (e.g. Gaussian).

Sketch of the architecture design for Variational Autoencoders. Notably, the architecture admits a stochastic node. model_select
Reparameterization trick

Gradient based optimization of the above formulation is still difficult, since taking gradient through a stochastic node (e.g. sampling from a distribution) is not differentiable. To overcome this, we use the reparameterization trick to sample from the posterior distribution qϕ(z|x), by sampling from a standard normal distribution and then transforming it to the posterior distribution. In particular, consider sampling from a Gaussian distribution N(μ,σ2), where μ and σ are parameters of the distribution. Sampling from the original distribution can be reparameterized as follows: ϵN(0,1)z=μ+σϵ This resolves the trainability issue, since the above expression provides a modified computation graph:

Generative Adversarial Networks

As an implicit model, GANs present a different perspective to modelling the data distribution Generative Adversarial Networks , Goodfellow 2014. In particular, GANs are composed of two networks:

Sketch of GANs, admits adversarial training to enable sampling from data-distribution. model_select

Training

The high level intuition behind training GANs is that we want to train the generator to create samples that are indistinguishable from the real data. At the same time, we want to train the discriminator to be able to distinguish between real and fake data. This presents the following minmax optimization problem: minGθmaxDϕExpdata(x)[logDϕ(x)]Real Samples+Ezp(z)[log(1Dϕ(Gθ(z)))]Synthetic Samples

Jensen's inequality for concave functions. model_select

Sampling

Sampling from GAN is done by sampling latents from the prior distribution p(z) and then passing it through the generator network. In particular, we have the following generative process: zp(z)x=Gθ(z)

Diffusion Models

The algorithms and models we've considered directly learn to either estimate density (VAEs) or generate samples (GANs). Notably, these algorithms are single-step, meaning that there is a single step of probabilistic inference (variational or implicit).

Diffusion models present iterative-samplers that incrementally walkback from a noisy distribution to realistic samples from data-distribution. model_select

Diffusion models are a class of generative models that are multi-step, meaning that they perform multiple steps (say T) of inference. In particular, diffusion models are composed of two phases:

For more details on the diffusion model, we refer the reader to these wonderful blog posts Generative Modeling by Estimating Gradients of the Data Distribution, Song 2021 What are Diffusion Models?, Weng 2021.

Training

Training the score function for diffusion models model_select

With a trained estimator for the score function, we can reverse the diffusion process to sample from the data-distribution.

Sampling from diffusion model. model_select