🟢 Hidden Beauty Behind Generative AI

Artem Kirsanov
https://www.youtube.com/watch?v=laaBLUxJUMY

Derivation Explanation

To derive the expression on the right from the expression on the left in the image, we are transitioning from the true distribution over all possible states (left-hand side) to an approximation based on samples from the dataset (right-hand side). This is a common technique in machine learning when we cannot access the full distribution $$P(x)$$, but have access to samples.

Left-hand side:

$$ \sum_x P(x) \log P_\theta(x) $$

This is an expectation over all possible states $$x$$, where $$P(x)$$ is the true distribution of the data, and $$P_\theta(x)$$ is the model's approximation parameterized by $$\theta$$.
We sum over all possible states $$x$$, but in most practical cases, this is computationally intractable since there could be an infinite or very large number of states.

Right-hand side:

$$ \frac{1}{N} \sum_{k=1}^N \log P_\theta(x_k) $$

Here, instead of summing over all states $$x$$, we approximate the expectation using a finite set of samples $${x_k}_{k=1}^N$$ drawn from the true distribution $$P(x)$$. This is known as Monte Carlo approximation or sampling-based approximation.

Steps in the Derivation:

Expectation as a Sum:

The left-hand side expression represents the expected value of $$\log P_\theta(x)$$ under the true data distribution $$P(x)$$:

$$ E_{x \sim P(x)}[\log P_\theta(x)] = \sum_x P(x) \log P_\theta(x) $$

But computing this exact expectation requires summing over all possible states $$x$$, which is often infeasible.

Monte Carlo Approximation:

If we can't sum over all possible states, we can instead approximate this expectation by averaging over a finite set of samples $${x_k}_{k=1}^N$$ drawn from the true distribution $$P(x)$$. This gives the right-hand side expression:

$$ \frac{1}{N} \sum_{k=1}^N \log P_\theta(x_k) $$

Here, $$N$$ is the number of samples, and each $$x_k$$ is a sample from the true distribution. This is known as the empirical expectation.

Intuition:

We can't know the full probability distribution $$P(x)$$ because it might be too complex or high-dimensional, but we do have a dataset that consists of samples from $$P(x)$$.
Therefore, instead of summing over all possible states, we estimate the sum using the available samples $${x_k}_{k=1}^N$$.

This is the essence of transitioning from the true expectation to a sample-based approximation, which is a key idea in many machine learning algorithms, particularly in maximum likelihood estimation and gradient-based learning algorithms.