reward model equation

The reward model equation represents the negative log-likelihood loss for training the reward model $r_{\phi}$:

$r_{\phi}(x, y)$ is the reward model with parameters $\phi$.
$(x, y_w, y_l)$ is a tuple from the dataset $\mathcal{D}$, where:
- $x$ is the input
- $y_w$ is the preferred output
- $y_l$ is the less preferred output
$\sigma(z) = \frac{1}{1+e^{-z}}$ is the logistic function.
$r_{\phi}(x, y_w) - r_{\phi}(x, y_l)$ computes the difference in rewards.
$\sigma(r_{\phi}(x, y_w) - r_{\phi}(x, y_l))$ represents the probability that $y_w$ is preferred over $y_l$.
The log of this probability is taken, and the negative expectation over the dataset forms the loss function.

This loss encourages the model to assign higher rewards to preferred outputs.

tlkahn/a.md