Skip to content

Instantly share code, notes, and snippets.

@tlkahn
Last active January 30, 2025 01:57
Show Gist options
  • Save tlkahn/2d6670dcfbc1664a0fcfc603512cdc62 to your computer and use it in GitHub Desktop.
Save tlkahn/2d6670dcfbc1664a0fcfc603512cdc62 to your computer and use it in GitHub Desktop.
reward model equation

The reward model equation represents the negative log-likelihood loss for training the reward model $r_{\phi}$:

  1. $r_{\phi}(x, y)$ is the reward model with parameters $\phi$.

  2. $(x, y_w, y_l)$ is a tuple from the dataset $\mathcal{D}$, where:

    • $x$ is the input
    • $y_w$ is the preferred output
    • $y_l$ is the less preferred output
  3. $\sigma(z) = \frac{1}{1+e^{-z}}$ is the logistic function.

  4. $r_{\phi}(x, y_w) - r_{\phi}(x, y_l)$ computes the difference in rewards.

  5. $\sigma(r_{\phi}(x, y_w) - r_{\phi}(x, y_l))$ represents the probability that $y_w$ is preferred over $y_l$.

  6. The log of this probability is taken, and the negative expectation over the dataset forms the loss function.

This loss encourages the model to assign higher rewards to preferred outputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment