The reward model equation represents the negative log-likelihood loss for training the reward model
-
$r_{\phi}(x, y)$ is the reward model with parameters$\phi$ . -
$(x, y_w, y_l)$ is a tuple from the dataset$\mathcal{D}$ , where:-
$x$ is the input -
$y_w$ is the preferred output -
$y_l$ is the less preferred output
-
-
$\sigma(z) = \frac{1}{1+e^{-z}}$ is the logistic function. -
$r_{\phi}(x, y_w) - r_{\phi}(x, y_l)$ computes the difference in rewards. -
$\sigma(r_{\phi}(x, y_w) - r_{\phi}(x, y_l))$ represents the probability that$y_w$ is preferred over$y_l$ . -
The log of this probability is taken, and the negative expectation over the dataset forms the loss function.
This loss encourages the model to assign higher rewards to preferred outputs.