A proof showing that collaborative filtering is equaivalent to a special type of 3-layer neural networks for multi-class classification

In Andrew Ng's Coursera course Machine Learning, he introduced a collaborative filtering algorithm, where the optimization objective is

$$ \min_{x^{(1)}, \dots, x^{(n_m)} \ \theta^{(1)}, \dots, \theta^{(n_\mu)}} \frac{1}{2} \sum_{(i, j): r(i ,j) = 1} ((\theta^{(j)})^T x^{(i)} - y^{(i, j)})^2 + \frac{\lambda}{2} \sum_{i = 1}^{n_m} \sum_{k = 1}^n (x^{(i)}k)^2 + \frac{\lambda}{2} \sum{j = 1}^{n_\mu} \sum_{k = 1}^n (\theta^{(j)}_k)^2 $$

and the vectorized expression is

$$ \min_{X, \Theta} \frac{1}{2} ||{R \circ (X \Theta^T - Y)}||^2 + \frac{\lambda}{2} ||X||^2 + \frac{\lambda}{2} ||\Theta||^2 $$

where $\circ$ denotes element-wise product, and $||A||^2$ denotes the squared sum of all elements of $A$.

While for a neural network, the optimization objective is

$$ \min_{\Theta} \frac{1}{m} \sum_{i = 1}^m \sum_{k = 1}^K cost(h_\theta(x^{(i)})k, y^{(i)}k) + \frac{\lambda}{2 m} \sum{l = 1}^{L - 1} \sum{i = 1}^{S_l} \sum_{j = 1}^{S_{l + 1}} (\Theta^{(l)}_{j i})^2 $$

If we let it be a 3-layer neural network for multi-class classifiction aka $L = 3$, with squared error $ cost(y_1, y_2) = \frac{1}{2} (y_1 - y_2)^2 $ and identity activation $ g(x) = x $ so $ a^{(l)} = g(z^{(l)}) = g(\Theta^{(l - 1)} a^{(l - 1)}) = \Theta^{(l - 1)} a^{(l - 1)} $, we have

$$ h_\theta(x) = a^{(3)} = g(\Theta^{(2)} g(\Theta^{(1)} x)) = \Theta^{(2)} \Theta^{(1)} x $$

and the optimization objective becomes

$$ \min_{\Theta} \frac{1}{2 m} \sum_{i = 1}^m \sum_{k = 1}^K ((\Theta^{(2)} \Theta^{(1)} x^{(i)})k - y^{(i)}k)^2 + \frac{\lambda}{2 m} \sum{i = 1}^{S_1} \sum{j = 1}^{S_2} (\Theta^{(1)}{j i})^2 + \frac{\lambda}{2 m} \sum{i = 1}^{S_2} \sum_{j = 1}^{S_3} (\Theta^{(2)}_{j i})^2 $$

and the vectorized expression is

$$ \min_{\Theta} \frac{1}{2 m} ||X (\Theta^{(1)})^T (\Theta^{(2)})^T - Y||^2 + \frac{\lambda}{2 m} ||\Theta^{(1)}||^2 + \frac{\lambda}{2 m} ||\Theta^{(2)}||^2 $$

We can leave the constant $m$s and it becomes

$$ \min_{\Theta} \frac{1}{2} ||X (\Theta^{(1)})^T (\Theta^{(2)})^T - Y||^2 + \frac{\lambda}{2} ||\Theta^{(1)}||^2 + \frac{\lambda}{2} ||\Theta^{(2)}||^2 $$

This is now very similar to the collaborative filtering optimization objective. Additionally, let $S_1 = n_m, S_2 = n, S_3 = K = n_\mu, n < n_m \text{ aka } S_2 < S_1, n < n_\mu \text{ aka } S_2 < S_3 $ without prepending constant $1$s to vectors in traditional networks, and let the input be $X = I_{n_m}$, and we define the cost between unknowns to be 0 aka $cost(x, ?) = cost(?, x) = 0$, then it becomes a 3-layer neural network that answers the question how much the users prefer a specif item with a input being a standard base vector $e_i$ denoting which item it is. And the optimization objective is

$$ \min_{\Theta} \frac{1}{2} ||R \circ ((\Theta^{(1)})^T (\Theta^{(2)})^T - Y)||^2 + \frac{\lambda}{2} ||\Theta^{(1)}||^2 + \frac{\lambda}{2} ||\Theta^{(2)}||^2 $$

For the last step, we replace $\Theta^{(1)}$ with $X^T$ and $\Theta^{(2)}$ with $\Theta$, which are notations of parameters in collaborative filtering, and then this optimization objective becomes exactly the same as that of collaborative filtering.

Symmetrically, we can also swap $n_m$ and $n_\mu$ and the layer dimensions become $S_1 = n_\mu, S_2 = n, S_3 = K = n_m$.

ShreckYe/collaborative-filtering-equaivalent-to-3-layer-neural-network-proof.md