Deriving the DPO Loss from First Principles

Community Article Published December 30, 2025

In my previous post, I worked through the derivation of the PPO loss used in RLHF for LLMs. By the end, we arrived at a fairly daunting objective function with multiple components: clipped surrogate, value function loss, entropy bonus and KL penalty. It is not just that the final objective is intimidating but the entire RLHF pipeline is complex and multi-step. You first train a separate reward model to reflect human preferences then fine-tune the LLM using RL with PPO.

That brings us to Direct Preference Optimization (DPO). DPO is a computationally lightweight alternative that directly optimizes LLMs to adhere to human preferences without explicit reward modeling or reinforcement learning. The key insight is that DPO implicitly optimizes the same objective as PPO-based RLHF (reward maximization with a KL-divergence constraint) but it replaces the entire reward model + PPO loop with a single supervised objective on preference pairs. There is no sampling during training, no value function, no clipping, just a classification loss!

Here I derive the DPO loss showing exactly how this simplification is possible. I will assume familiarity with concepts from the PPO post, particularly the reward model and the KL-constrained RLHF objective.

Again, a huge shoutout to Umar Jamil's video on DPO for an excellent walkthrough that helped me understand the derivation.

I: The RLHF Objective

Let's recall the RLHF objective from the PPO blog. The goal of RLHF is to find a policy πθ\pi_\theta that maximizes expected reward while staying close to a reference model πref\pi_{\text{ref}}:

JRLHF(θ)=ExD,yπθ(yx)[rϕ(x,y)]βDKL(πθ(yx)πref(yx)) J_{\text{RLHF}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(y|x)}\left[r_\phi(x, y)\right] - \beta \cdot D_{\text{KL}}\left(\pi_\theta(y|x) \| \pi_{\text{ref}}(y|x)\right)

The first term encourages the model to generate high-reward responses. The second term (KL penalty) prevents the model from drifting too far from the reference which helps avoid reward hacking and maintains language quality.

As we saw in the PPO blog, we can't optimize this objective directly with gradient descent because the expectation Eyπθ(yx)[]\mathbb{E}_{y \sim \pi_\theta(y|x)}[\cdot] requires sampling from the policy and sampling is non-differentiable. This is why we needed reinforcement learning algorithms like REINFORCE and PPO. They provide ways to estimate policy gradients without differentiating through the sampling process.

What if we could reformulate the problem so that we don't need to sample from the policy during training? This is exactly what DPO will achieve.

II: The Bradley-Terry Model for Preference Learning

We also need to understand the Bradley-Terry model for reward model training in a bit more detail with focus on why it works the way it does.

Training a reward model requires human-labeled preference data that compares pairs of responses. D={(x(i),yw(i),yl(i))}i=1N \mathcal{D} = \left\{(x^{(i)}, y_w^{(i)}, y_l^{(i)})\right\}_{i=1}^{N}

where:

  • xx is the prompt
  • ywy_w is the preferred (winning) response
  • yly_l is the dispreferred (losing) response

The Bradley-Terry Probability Model

The Bradley-Terry model provides a principled way to convert comparison data (defined above) into a probabilistic model. It assumes there exists some latent reward function r(x,y)r^*(x, y) that captures true response quality and models the probability that response ywy_w is preferred over yly_l as:

P(ywylx)=er(x,yw)er(x,yw)+er(x,yl)(II.I) P(y_w \succ y_l | x) = \frac{e^{r^*(x, y_w)}}{e^{r^*(x, y_w)} + e^{r^*(x, y_l)}} \tag{II.I}

The intuition is straightforward that the responses with higher reward are exponentially more likely to be preferred.

A key step is recognizing that this ratio of exponentials can be written as a sigmoid function. This is important because it connects Bradley-Terry to standard binary classification.

Let A=r(x,yw)A = r(x, y_w) and B=r(x,yl)B = r(x, y_l). We want to show:

eAeA+eB=σ(AB) \frac{e^A}{e^A + e^B} = \sigma(A - B)

where σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}} is the sigmoid function.

Starting with the left-hand side:

eAeA+eB \frac{e^A}{e^A + e^B}

we can rewrite it as: =eA/eA(eA+eB)/eA=11+eB/eA=11+eBA = \frac{e^A / e^A}{(e^A + e^B) / e^A} = \frac{1}{1 + e^B / e^A} = \frac{1}{1 + e^{B-A}}

which is the sigmoid function of $(A - B)$: =11+e(AB)=σ(AB) = \frac{1}{1 + e^{-(A-B)}} = \sigma(A - B)

Therefore, the Bradley-Terry model can be written as:

P(ywylx)=σ(r(x,yw)r(x,yl))(II.II) \boxed{P(y_w \succ y_l | x) = \sigma\left(r(x, y_w) - r(x, y_l)\right)} \tag{II.II}

Reward Model Loss

Given a dataset of preferences D\mathcal{D}, we can train a parameterized reward model rϕ(x,y)r_\phi(x, y) using maximum likelihood estimation. We want to maximize the probability of observing the preferences in our dataset:

maxϕ(x,yw,yl)DP(ywylx) \max_\phi \prod_{(x, y_w, y_l) \in \mathcal{D}} P(y_w \succ y_l | x)

Taking the log and negating (to turn maximization into minimization), we get the negative log-likelihood loss:

LRM(ϕ)=E(x,yw,yl)D[logσ(rϕ(x,yw)rϕ(x,yl))](II.III) \boxed{\mathcal{L}_{\text{RM}}(\phi) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\left[\log \sigma\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)\right]} \tag{II.III}

This is just binary cross-entropy objective which helps the reward model learn to assign higher rewards to preferred responses. Notice that the Bradley-Terry model depends only on the difference of rewards: r(x,yw)r(x,yl)r(x, y_w) - r(x, y_l). The absolute values dont matter, it is only their relative ordering. This means:

If we add any constant cc or any function f(x)f(x) that depends only on the prompt (not the response), the preference probabilities dont change. This invariance property will be the key to deriving DPO.

III: Optimal Policy in Closed Form

Here, we will find the exact analytical optimal policy solution to the optimization problem. We want to find the policy that maximizes expected reward while keeping the KL divergence from the reference policy bounded:

maxπExD,yπ(yx)[r(x,y)]βDKL(π(yx)πref(yx))(III.I) \max_\pi \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi(y|x)}\left[r(x, y)\right] - \beta \cdot D_{\text{KL}}\left(\pi(y|x) \| \pi_{\text{ref}}(y|x)\right) \tag{III.I}

Note, I am writing π\pi instead of πθ\pi_\theta to emphasize that we are looking for the optimal policy in general not just the parameterized version.

Expanding the KL divergence:

DKL(π(yx)πref(yx))=Eyπ(yx)[logπ(yx)πref(yx)] D_{\text{KL}}\left(\pi(y|x) \| \pi_{\text{ref}}(y|x)\right) = \mathbb{E}_{y \sim \pi(y|x)}\left[\log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)}\right]

So our objective becomes:

maxπExDEyπ(yx)[r(x,y)βlogπ(yx)πref(yx)] \max_\pi \mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{y \sim \pi(y|x)}\left[r(x, y) - \beta \log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)}\right]

For a fixed prompt xx, we want to find the distribution π(x)\pi(\cdot|x) that maximizes:

Eyπ(yx)[r(x,y)βlogπ(yx)πref(yx)] \mathbb{E}_{y \sim \pi(y|x)}\left[r(x, y) - \beta \log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)}\right]

This is a constrained optimization problem over probability distributions. We can solve it using the method of Lagrange multipliers, enforcing that π(yx)\pi(y|x) sums to 1. For discrete yy:

L(π,λ)=yπ(yx)[r(x,y)βlogπ(yx)πref(yx)]+λ(1yπ(yx)) \mathcal{L}(\pi, \lambda) = \sum_y \pi(y|x) \left[r(x, y) - \beta \log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)}\right] + \lambda \left(1 - \sum_y \pi(y|x)\right)

Taking the derivative with respect to $\pi(y|x)$ and setting it to zero (stationary point):

Lπ(yx)=r(x,y)βlogπ(yx)πref(yx)βλ=0 \frac{\partial \mathcal{L}}{\partial \pi(y|x)} = r(x, y) - \beta \log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} - \beta - \lambda = 0

Now, solving for π(yx)\pi(y|x):

logπ(yx)πref(yx)=1β(r(x,y)βλ) \log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} = \frac{1}{\beta}\left(r(x, y) - \beta - \lambda\right)

π(yx)πref(yx)=exp(1βr(x,y))exp(1λβ) \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} = \exp\left(\frac{1}{\beta}r(x, y)\right) \cdot \exp\left(-1 - \frac{\lambda}{\beta}\right)

π(yx)=πref(yx)exp(1βr(x,y))exp(1λβ) \pi(y|x) = \pi_{\text{ref}}(y|x) \cdot \exp\left(\frac{1}{\beta}r(x, y)\right) \cdot \exp\left(-1 - \frac{\lambda}{\beta}\right)

The term exp(1λβ)\exp\left(-1 - \frac{\lambda}{\beta}\right) is a constant (with respect to yy) that ensures normalization. To find its value, we enforce that π(yx)\pi(y|x) must be a valid probability distribution and sum to 1:

yπ(yx)=1 \sum_y \pi(y|x) = 1

Substituting our expression for $\pi(y|x)$:

yπref(yx)exp(1βr(x,y))exp(1λβ)=1 \sum_y \pi_{\text{ref}}(y|x) \cdot \exp\left(\frac{1}{\beta}r(x, y)\right) \cdot \exp\left(-1 - \frac{\lambda}{\beta}\right) = 1

Since exp(1λβ)\exp\left(-1 - \frac{\lambda}{\beta}\right) doesn't depend on yy, we can factor it out of the sum:

exp(1λβ)yπref(yx)exp(1βr(x,y))=1 \exp\left(-1 - \frac{\lambda}{\beta}\right) \cdot \sum_y \pi_{\text{ref}}(y|x) \cdot \exp\left(\frac{1}{\beta}r(x, y)\right) = 1

Solving for the constant:

exp(1λβ)=1yπref(yx)exp(1βr(x,y)) \exp\left(-1 - \frac{\lambda}{\beta}\right) = \frac{1}{\sum_y \pi_{\text{ref}}(y|x) \cdot \exp\left(\frac{1}{\beta}r(x, y)\right)}

We define this normalizing sum as the partition function Z(x)Z(x):

Z(x)=yπref(yx)exp(1βr(x,y))(III.II) Z(x) = \sum_y \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta}r(x, y)\right) \tag{III.II}

Substituting back, we get the optimal policy:

πr(yx)=1Z(x)πref(yx)exp(1βr(x,y))(III.III) \boxed{\pi_r(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta}r(x, y)\right)} \tag{III.III}

We have an exact closed-form expression for the optimal policy. However, we cannot compute it directly because Z(x)Z(x) is intractable. To compute it, we need to sum over all possible responses yy which not possible.

IV: The Reparameterization Trick

The key insight of DPO is to flip the relationship between reward and policy. Now, we frame the problem as: "given an optimal policy what reward function does it correspond to?"

Starting from the optimal policy equation (III.III):

πr(yx)=1Z(x)πref(yx)exp(1βr(x,y)) \pi_r(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta}r(x, y)\right)

We solve for the reward r(x,y)r(x, y) by first taking the log of both sides:

logπr(yx)=logπref(yx)+1βr(x,y)logZ(x) \log \pi_r(y|x) = \log \pi_{\text{ref}}(y|x) + \frac{1}{\beta}r(x, y) - \log Z(x)

Now rearrange to get r(x,y)r(x, y) on left-side:

1βr(x,y)=logπr(yx)logπref(yx)+logZ(x) \frac{1}{\beta}r(x, y) = \log \pi_r(y|x) - \log \pi_{\text{ref}}(y|x) + \log Z(x)

r(x,y)=βlogπr(yx)βlogπref(yx)+βlogZ(x) r(x, y) = \beta \log \pi_r(y|x) - \beta \log \pi_{\text{ref}}(y|x) + \beta \log Z(x)

This can be written more compactly as:

r(x,y)=βlogπr(yx)πref(yx)+βlogZ(x)(IV.I) \boxed{r(x, y) = \beta \log \frac{\pi_r(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)} \tag{IV.I}

The reward is expressed as:

  • A term involving the log-ratio of the optimal policy to the reference policy
  • βlogZ(x)\beta \log Z(x) which depends only on xx (not on yy)

V: Deriving the DPO Loss

Finally, we have all the pieces to derive the DPO loss. From Section II, the Bradley-Terry preference model is:

P(ywylx)=σ(r(x,yw)r(x,yl)) P(y_w \succ y_l | x) = \sigma\left(r(x, y_w) - r(x, y_l)\right)

From Section IV, assuming we have access to an optimal policy π\pi^*, the reward can be written as:

r(x,y)=βlogπ(yx)πref(yx)+βlogZ(x) r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)

Substituting this into Bradley-Terry:

P(ywylx)=σ([βlogπ(ywx)πref(ywx)+βlogZ(x)][βlogπ(ylx)πref(ylx)+βlogZ(x)]) P(y_w \succ y_l | x) = \sigma\left(\left[\beta \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} + \beta \log Z(x)\right] - \left[\beta \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)} + \beta \log Z(x)\right]\right)

Simplifying the expression inside the sigmoid:

=σ(βlogπ(ywx)πref(ywx)+βlogZ(x)βlogπ(ylx)πref(ylx)βlogZ(x)) = \sigma\left(\beta \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} + \beta \log Z(x) - \beta \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)} - \beta \log Z(x)\right)

The βlogZ(x)\beta \log Z(x) terms cancel:

=σ(βlogπ(ywx)πref(ywx)βlogπ(ylx)πref(ylx)) = \sigma\left(\beta \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)

Now recall the critical insight from Section II where we mentioned that the Bradley-Terry model depends only on reward differences. Thus, when we compute r(x,yw)r(x,yl)r(x, y_w) - r(x, y_l) the intractable partition function Z(x)Z(x) cancels out. This is what makes DPO possible.

We can write this more cleanly by defining the implicit reward in terms of the optimal policy:

r^(x,y)=βlogπ(yx)πref(yx) \hat{r}(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)}

Thus:

P(ywylx)=σ(r^(x,yw)r^(x,yl)) P(y_w \succ y_l | x) = \sigma\left(\hat{r}(x, y_w) - \hat{r}(x, y_l)\right)

We dont actually have access to the optimal policy π\pi^*. But we can parameterize a policy πθ\pi_\theta and optimize it to maximize the likelihood of the observed preferences. This is exactly what the reward model loss (II.III) does except now our reward is implicitly defined by the policy itself.

The DPO loss is the negative log-likelihood:

LDPO(πθ;πref)=E(x,yw,yl)D[logσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))] \boxed{\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]}

for implicit reward notation, it can be written as:

LDPO(πθ;πref)=E(x,yw,yl)D[logσ(r^θ(x,yw)r^θ(x,yl))](V.II) \mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\left[\log \sigma\left(\hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l)\right)\right] \tag{V.II}

Some key insights from the above DPO loss:

  • The policy implicitly defines its own reward via the log-ratio with the reference. There is no separate reward model.
  • This is just a supervised classification loss on preference pairs with no RL.
  • DPO uses the fixed preference dataset D\mathcal{D}. Thus, no sampling during training.
  • No value function needed since we're not doing policy gradients
  • DPO still optimizes the KL-constrained reward maximization objective but in a different way

VI: Building Intuition for DPO

Now that we have the DPO loss we can build some intuition around the implicit reward model and its gradient updates.

Implicit Reward Model

The DPO paper subtitle is "Your Language Model is Secretly a Reward Model" and this captures the key insight as we are using the LLM for implicit reward. The policy πθ\pi_\theta defines an implicit reward function:

r^θ(x,y)=βlogπθ(yx)πref(yx) \hat{r}_\theta(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}

This reward measures how much more likely the current policy is to generate response yy compared to the reference policy, scaled by β\beta.

  • If πθ(yx)>πref(yx)\pi_\theta(y|x) > \pi_{\text{ref}}(y|x): The implicit reward is positive (the policy "likes" this response more than reference)
  • If πθ(yx)<πref(yx)\pi_\theta(y|x) < \pi_{\text{ref}}(y|x): The implicit reward is negative (the policy "likes" this response less than reference)

Analyzing the Gradient Update

We can flex our brain muscles one more time and compute the gradient for the DPO objective:

LDPO=E[logσ(r^θ(x,yw)r^θ(x,yl))] \mathcal{L}_{\text{DPO}} = -\mathbb{E}\left[\log \sigma\left(\hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l)\right)\right]

Let u=r^θ(x,yw)r^θ(x,yl)u = \hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l). Using the chain rule:

θLDPO=E[σ(u)σ(u)θu] \nabla_\theta \mathcal{L}_{\text{DPO}} = -\mathbb{E}\left[\frac{\sigma'(u)}{\sigma(u)} \nabla_\theta u\right]

Using the property σ(u)=σ(u)(1σ(u))\sigma'(u) = \sigma(u)(1 - \sigma(u)):

=E[σ(u)(1σ(u))σ(u)θu]=E[(1σ(u))θu] = -\mathbb{E}\left[\frac{\sigma(u)(1-\sigma(u))}{\sigma(u)} \nabla_\theta u\right] = -\mathbb{E}\left[(1 - \sigma(u)) \nabla_\theta u\right]

Using 1σ(u)=σ(u)1 - \sigma(u) = \sigma(-u):

=E[σ(u)θu] = -\mathbb{E}\left[\sigma(-u) \nabla_\theta u\right]

Now, u=r^θ(x,yl)r^θ(x,yw)-u = \hat{r}_\theta(x, y_l) - \hat{r}_\theta(x, y_w), and:

θu=θr^θ(x,yw)θr^θ(x,yl)=βθlogπθ(ywx)βθlogπθ(ylx) \nabla_\theta u = \nabla_\theta \hat{r}_\theta(x, y_w) - \nabla_\theta \hat{r}_\theta(x, y_l) = \beta \nabla_\theta \log \pi_\theta(y_w|x) - \beta \nabla_\theta \log \pi_\theta(y_l|x)

Putting it together:

θLDPO=βE[σ(r^θ(x,yl)r^θ(x,yw))weight(θlogπθ(ywx)increase ywθlogπθ(ylx)decrease yl)] \boxed{\nabla_\theta \mathcal{L}_{\text{DPO}} = -\beta \mathbb{E}\left[\underbrace{\sigma\left(\hat{r}_\theta(x, y_l) - \hat{r}_\theta(x, y_w)\right)}_{\text{weight}}\left(\underbrace{\nabla_\theta \log \pi_\theta(y_w|x)}_{\text{increase } y_w} - \underbrace{\nabla_\theta \log \pi_\theta(y_l|x)}_{\text{decrease } y_l}\right)\right]}

  • θlogπθ(ywx)\nabla_\theta \log \pi_\theta(y_w|x) points in the direction that increases probability of the preferred response
  • θlogπθ(ylx)-\nabla_\theta \log \pi_\theta(y_l|x) points in the direction that decreases probability of the dispreferred response
  • The weight term is high when r^θ(x,yl)>r^θ(x,yw)\hat{r}_\theta(x, y_l) > \hat{r}_\theta(x, y_w), i.e. when the model currently assigns higher implicit reward to the losing response than the winning response. In other words:
    • When the model is wrong (ranks yly_l above ywy_w), we get large gradient updates
    • When the model is right (ranks ywy_w above yly_l), we get small gradient updates

This dynamic sigmoid weighting is crucial. It naturally focuses learning on the examples the model currently gets wrong.

VII: Computing Log Probabilities in Practice

This section is fully adapted from Umar Jamil's video. I think it is essential to understand how log probabilities are computed in practice.

The DPO loss requires computing logπθ(yx)\log \pi_\theta(y|x), the log probability of a complete response yy given a prompt xx. Let's see how this works in practice with LLMs.

Language models are autoregressive: they generate text one token at a time, conditioning on all previous tokens. For a response y=(y1,y2,,yT)y = (y_1, y_2, \ldots, y_T), the probability factorizes as:

πθ(yx)=t=1Tπθ(ytx,y1,,yt1)=t=1Tπθ(ytx,y<t) \pi_\theta(y|x) = \prod_{t=1}^{T} \pi_\theta(y_t | x, y_1, \ldots, y_{t-1}) = \prod_{t=1}^{T} \pi_\theta(y_t | x, y_{<t})

Taking the logarithm:

logπθ(yx)=t=1Tlogπθ(ytx,y<t)(VII.I) \boxed{\log \pi_\theta(y|x) = \sum_{t=1}^{T} \log \pi_\theta(y_t | x, y_{<t})} \tag{VII.I}

The log probability of the full response is the sum of log probabilities at each position.

Here's how to compute logπθ(yx)\log \pi_\theta(y|x):

  1. Prepare input: Concatenate the prompt and response into a single sequence

    input=[x1,x2,,xn,y1,y2,,yT]\text{input} = [x_1, x_2, \ldots, x_n, y_1, y_2, \ldots, y_T]

  2. Forward pass: Run the transformer to get hidden states at each position

  3. Project to logits: Apply the language model head (typically a linear layer) to get vocabulary logits at each position

  4. Log softmax: Convert logits to log probabilities over the vocabulary using logsoftmax

  5. Gather relevant log probs: For each position tt in the response extract the log probability of the actual next token yty_t (since we know the output)

  6. Sum with masking: Sum the log probabilities but only for response tokens (not prompt tokens)

    logπθ(yx)=tres poslogπθ(ytx,y<t)\log \pi_\theta(y|x) = \sum_{t \in \text{res pos}} \log \pi_\theta(y_t | x, y_{<t})

This gives us logπθ(yx)\log \pi_\theta(y|x) for one response. We do this for both the preferred response ywy_w and the dispreferred response yly_l and we also do it for both the policy model πθ\pi_\theta and the frozen reference model πref\pi_{\text{ref}}. With these four log probabilities in hand, we can compute the DPO loss.

Conclusion

Once you derive the DPO loss, you start appreciating the simplicity and elegance of the solution especially when compared to PPO. The derivation hinges on one observation that the Bradley-Terry model only cares about reward differences and this causes the intractable partition function from analytical solution to cancel out completely. In turn, what remains is a straightforward classification loss.

References

Community

Sign up or log in to comment