Intro to Generative Adversarial Networks

4 minute read


This is the introduction of Generative Adversarial Networks (GANs) method with some key concepts of machine learning, including KL divergence and MLE.

Kullback-Leibler Divergence

In statistics, the Kullback-Leibler (KL) divergence is used to measure the difference of two different distributions, which is defined as:
\(D_{KL}(P||Q)=E_{X\sim P}[log\frac{P(X)}{Q(X)}]\)

where $P(X)$ and $Q(X)$ are two separate distributions of random variable X. If the distributions are of discrete random variable with sample space $S$, the KL divergence is defined as
\(D_{KL}(P||Q)=\sum_{x\in S}P(x)log\frac{P(x)}{Q(x)}\)

In addition, if the distributions are of continuous random variable, the KL divergence is defined as

In the field of information theory, the KL divergence from $Q$ to $P$, denoted as $D{KL}(P||Q)$, is the extra amount of information needed to convey the signal of distribution $P$ using the codes in distribution $Q$. If the value of KL divergence is close to $0$, it means that these two distributions are almost the same. In this case, we set $P$ as the empirical distribution of data we collect and we also set $Q$ as the estimated distribution of the data, denoted them as $\hat{p}{data}$ and $p_{model}$ separately. So, we have:
\(D_{KL}(\hat{p}_{data}||p_{model})=E_{X\sim \hat{p}_{data}}[log\frac{\hat{p}_{data}(X)}{p_{model}(X)}]=E_{X\sim \hat{p}_{data}}[log(\hat{p}_{data}(X))-log(p_{model}(X))]\)

KL Divergence and MLE

The KL divergence above measures how much information we lost after fitting a model with data. In model building, we usually minimize such loss information corresponding to the parameter of the distribution which is denoted as $\theta$. In other word, minimizing the KL divergence is exactly the same as minimizing:
\(-E_{X\sim \hat{p}_{data}}[log(p_{model}(X;\theta))]\)

By rewriting the formula, we have:
\(\hat{\theta}_{MLE} = argmax ( E_{X\sim \hat{p}_{data}}[log(p_{model}(X;\theta))])\)

Considering $\hat{p}_{data}$ is defined by the training data, we can acquire the maximum likelihood estimator by maximizing the log-likelihood function. In this case, the optimal $\theta$ is the same when maximizing the likelihood or minimizing the KL divergence.

Explicit Density Models

In maximum likelihood method, the density function may be computationally tractable if the models construct an explicit density $p_{model}(X;\theta)$. For tractable density, we usually utilize methods like Fully Visible Belief Networks (FVBNs) and nonlinear Independent Component Analysis (ICA). If the model is intractable, some approximation methods are considered for maximizing the likelihood function, for example, the Variational Autoencoder (VAE) and Markov chain approximations are preferred. However, these methods have their own disadvantages.

  • For FVBNs, we can not ignore the complexity of computation since it cannot compute in parallel.

  • For nonlinear ICA, the main drawback is that it is not easy to find a invertible generator function $g$ with the latent variable $z$ that has the same dimensionality as $x$.

  • For VAE, the consistency issue occurs when we accidentally choose some weak posterior or prior distribution.

  • For Markov chain approximations, since there is no clear way to test the convergence of the results, the algorithms might run for a long time.

The disadvantages above are common in maximum likelihood method of explicit density models. Luckily, the Generative Adversarial Networks (GANs) can avoid such drawbacks.

Generative Adversarial Networks

For both explicit and implicit density, the Generative Adversarial Networks can perform better. The basic idea of GANs is like training two player against each other in a game. One of the player is called generator who generate sample from the same intended distribution and the another one is called discriminator who tests whether the new sample is from the true distribution or not.
Generally, GANs are a kind of structured probabilistic model with latent variable $z$ and observed variable $x$ corresponding with characteristic parameters $\theta$. Intuitively, we want the generator can create ‘true’ sample to cheat the discriminator meanwhile the discriminator can discriminate the ‘fake’ trick by the generator. If the discriminator will share the updated information of such identification, the generative model can produce the new sample we want.
This method can be separated into two part. Firstly, we train the random sample $X$ from the training set to build a discriminator function $D$. Secondly, the generator can create fake sample(s) $G$ from noise with the prior distribution with latent variable. At the same time, the discriminator receives $G$ as new inputs trying to make $D(G)$ approach 0 while the generative strives to make it approach 1. When the competing method achieve the Nash equilibrium, we output the generative sample.

Drawbacks of GANs

Anyway, the GANs is not a perfect algorithm. The disadvantages of GANs are:

  • Nash equilibrium: Training GANs needs to find Nash equilibrium which is more difficult than optimizing the object function.
  • Non-convergence: In practice, the GANs cannot converge or will converge in a slow rate without eventually reaching an equilibrium. e.g. Mode collapse.
  • Evaluation: It is difficult to apply clearly justified way to quantitatively evaluate a GANs model.