Deriving the DPO Loss from First Principles

Community Article Published December 30, 2025

In my previous post, I worked through the derivation of the PPO loss used in RLHF for LLMs. By the end, we arrived at a fairly daunting objective function with multiple components: clipped surrogate, value function loss, entropy bonus and KL penalty. It is not just that the final objective is intimidating but the entire RLHF pipeline is complex and multi-step. You first train a separate reward model to reflect human preferences then fine-tune the LLM using RL with PPO.

That brings us to Direct Preference Optimization (DPO). DPO is a computationally lightweight alternative that directly optimizes LLMs to adhere to human preferences without explicit reward modeling or reinforcement learning. The key insight is that DPO implicitly optimizes the same objective as PPO-based RLHF (reward maximization with a KL-divergence constraint) but it replaces the entire reward model + PPO loop with a single supervised objective on preference pairs. There is no sampling during training, no value function, no clipping, just a classification loss!

Here I derive the DPO loss showing exactly how this simplification is possible. I will assume familiarity with concepts from the PPO post, particularly the reward model and the KL-constrained RLHF objective.

Again, a huge shoutout to Umar Jamil's video on DPO for an excellent walkthrough that helped me understand the derivation.

I: The RLHF Objective

Let's recall the RLHF objective from the PPO blog. The goal of RLHF is to find a policy $\pi_\theta$ that maximizes expected reward while staying close to a reference model $\pi_{\text{ref}}$ :

$J_{\text{RLHF}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(y|x)}\left[r_\phi(x, y)\right] - \beta \cdot D_{\text{KL}}\left(\pi_\theta(y|x) \| \pi_{\text{ref}}(y|x)\right)$

The first term encourages the model to generate high-reward responses. The second term (KL penalty) prevents the model from drifting too far from the reference which helps avoid reward hacking and maintains language quality.

As we saw in the PPO blog, we can't optimize this objective directly with gradient descent because the expectation $\mathbb{E}_{y \sim \pi_\theta(y|x)}[\cdot]$ requires sampling from the policy and sampling is non-differentiable. This is why we needed reinforcement learning algorithms like REINFORCE and PPO. They provide ways to estimate policy gradients without differentiating through the sampling process.

What if we could reformulate the problem so that we don't need to sample from the policy during training? This is exactly what DPO will achieve.

II: The Bradley-Terry Model for Preference Learning

We also need to understand the Bradley-Terry model for reward model training in a bit more detail with focus on why it works the way it does.

Training a reward model requires human-labeled preference data that compares pairs of responses. $\mathcal{D} = \left\{(x^{(i)}, y_w^{(i)}, y_l^{(i)})\right\}_{i=1}^{N}$

where:

$x$ is the prompt
$y_{w}$ is the preferred (winning) response
$y_{l}$ is the dispreferred (losing) response

The Bradley-Terry Probability Model

The Bradley-Terry model provides a principled way to convert comparison data (defined above) into a probabilistic model. It assumes there exists some latent reward function $r^{*} (x, y)$ that captures true response quality and models the probability that response $y_{w}$ is preferred over $y_{l}$ as:

$P(y_w \succ y_l | x) = \frac{e^{r^*(x, y_w)}}{e^{r^*(x, y_w)} + e^{r^*(x, y_l)}} \tag{II.I}$

The intuition is straightforward that the responses with higher reward are exponentially more likely to be preferred.

A key step is recognizing that this ratio of exponentials can be written as a sigmoid function. This is important because it connects Bradley-Terry to standard binary classification.

Let $A = r (x, y_{w})$ and $B = r (x, y_{l})$ . We want to show:

$\frac{e^A}{e^A + e^B} = \sigma(A - B)$

where $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid function.

Starting with the left-hand side:

$\frac{e^A}{e^A + e^B}$

we can rewrite it as: $= \frac{e^A / e^A}{(e^A + e^B) / e^A} = \frac{1}{1 + e^B / e^A} = \frac{1}{1 + e^{B-A}}$

which is the sigmoid function of $(A - B)$: $= \frac{1}{1 + e^{-(A-B)}} = \sigma(A - B)$

Therefore, the Bradley-Terry model can be written as:

$\boxed{P(y_w \succ y_l | x) = \sigma\left(r(x, y_w) - r(x, y_l)\right)} \tag{II.II}$

Reward Model Loss

Given a dataset of preferences $\mathcal{D}$ , we can train a parameterized reward model $r_\phi(x, y)$ using maximum likelihood estimation. We want to maximize the probability of observing the preferences in our dataset:

$\max_\phi \prod_{(x, y_w, y_l) \in \mathcal{D}} P(y_w \succ y_l | x)$

Taking the log and negating (to turn maximization into minimization), we get the negative log-likelihood loss:

$\boxed{\mathcal{L}_{\text{RM}}(\phi) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\left[\log \sigma\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)\right]} \tag{II.III}$

This is just binary cross-entropy objective which helps the reward model learn to assign higher rewards to preferred responses. Notice that the Bradley-Terry model depends only on the difference of rewards: $r (x, y_{w}) - r (x, y_{l})$ . The absolute values dont matter, it is only their relative ordering. This means:

If we add any constant $c$ or any function $f (x)$ that depends only on the prompt (not the response), the preference probabilities dont change. This invariance property will be the key to deriving DPO.

III: Optimal Policy in Closed Form

Here, we will find the exact analytical optimal policy solution to the optimization problem. We want to find the policy that maximizes expected reward while keeping the KL divergence from the reference policy bounded:

$\max_\pi \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi(y|x)}\left[r(x, y)\right] - \beta \cdot D_{\text{KL}}\left(\pi(y|x) \| \pi_{\text{ref}}(y|x)\right) \tag{III.I}$

Note, I am writing $π \pi$ instead of $\pi_\theta$ to emphasize that we are looking for the optimal policy in general not just the parameterized version.

Expanding the KL divergence:

$D_{\text{KL}}\left(\pi(y|x) \| \pi_{\text{ref}}(y|x)\right) = \mathbb{E}_{y \sim \pi(y|x)}\left[\log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)}\right]$

So our objective becomes:

$\max_\pi \mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{y \sim \pi(y|x)}\left[r(x, y) - \beta \log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)}\right]$

For a fixed prompt $x$ , we want to find the distribution $\pi(\cdot|x)$ that maximizes:

$\mathbb{E}_{y \sim \pi(y|x)}\left[r(x, y) - \beta \log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)}\right]$

This is a constrained optimization problem over probability distributions. We can solve it using the method of Lagrange multipliers, enforcing that $\pi(y|x)$ sums to 1. For discrete $y$ :

$\mathcal{L}(\pi, \lambda) = \sum_y \pi(y|x) \left[r(x, y) - \beta \log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)}\right] + \lambda \left(1 - \sum_y \pi(y|x)\right)$

Taking the derivative with respect to $\pi(y|x)$ and setting it to zero (stationary point):

$\frac{\partial \mathcal{L}}{\partial \pi(y|x)} = r(x, y) - \beta \log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} - \beta - \lambda = 0$

Now, solving for $\pi(y|x)$ :

$\log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} = \frac{1}{\beta}\left(r(x, y) - \beta - \lambda\right)$

$\frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} = \exp\left(\frac{1}{\beta}r(x, y)\right) \cdot \exp\left(-1 - \frac{\lambda}{\beta}\right)$

$\pi(y|x) = \pi_{\text{ref}}(y|x) \cdot \exp\left(\frac{1}{\beta}r(x, y)\right) \cdot \exp\left(-1 - \frac{\lambda}{\beta}\right)$

The term $\exp\left(-1 - \frac{\lambda}{\beta}\right)$ is a constant (with respect to $y$ ) that ensures normalization. To find its value, we enforce that $\pi(y|x)$ must be a valid probability distribution and sum to 1:

$\sum_y \pi(y|x) = 1$

Substituting our expression for $\pi(y|x)$:

$\sum_y \pi_{\text{ref}}(y|x) \cdot \exp\left(\frac{1}{\beta}r(x, y)\right) \cdot \exp\left(-1 - \frac{\lambda}{\beta}\right) = 1$

Since $\exp\left(-1 - \frac{\lambda}{\beta}\right)$ doesn't depend on $y$ , we can factor it out of the sum:

$\exp\left(-1 - \frac{\lambda}{\beta}\right) \cdot \sum_y \pi_{\text{ref}}(y|x) \cdot \exp\left(\frac{1}{\beta}r(x, y)\right) = 1$

Solving for the constant:

$\exp\left(-1 - \frac{\lambda}{\beta}\right) = \frac{1}{\sum_y \pi_{\text{ref}}(y|x) \cdot \exp\left(\frac{1}{\beta}r(x, y)\right)}$

We define this normalizing sum as the partition function $Z (x)$ :

$Z(x) = \sum_y \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta}r(x, y)\right) \tag{III.II}$

Substituting back, we get the optimal policy:

$\boxed{\pi_r(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta}r(x, y)\right)} \tag{III.III}$

We have an exact closed-form expression for the optimal policy. However, we cannot compute it directly because $Z (x)$ is intractable. To compute it, we need to sum over all possible responses $y$ which not possible.

IV: The Reparameterization Trick

The key insight of DPO is to flip the relationship between reward and policy. Now, we frame the problem as: "given an optimal policy what reward function does it correspond to?"

Starting from the optimal policy equation (III.III):

$\pi_r(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta}r(x, y)\right)$

We solve for the reward $r (x, y)$ by first taking the log of both sides:

$\log \pi_r(y|x) = \log \pi_{\text{ref}}(y|x) + \frac{1}{\beta}r(x, y) - \log Z(x)$

Now rearrange to get $r (x, y)$ on left-side:

$\frac{1}{\beta}r(x, y) = \log \pi_r(y|x) - \log \pi_{\text{ref}}(y|x) + \log Z(x)$

$r(x, y) = \beta \log \pi_r(y|x) - \beta \log \pi_{\text{ref}}(y|x) + \beta \log Z(x)$

This can be written more compactly as:

$\boxed{r(x, y) = \beta \log \frac{\pi_r(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)} \tag{IV.I}$

The reward is expressed as:

A term involving the log-ratio of the optimal policy to the reference policy
$\beta \log Z(x)$ which depends only on $x$ (not on $y$ )

V: Deriving the DPO Loss

Finally, we have all the pieces to derive the DPO loss. From Section II, the Bradley-Terry preference model is:

$P(y_w \succ y_l | x) = \sigma\left(r(x, y_w) - r(x, y_l)\right)$

From Section IV, assuming we have access to an optimal policy $π^{*} \pi^*$ , the reward can be written as:

$r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)$

Substituting this into Bradley-Terry:

$P(y_w \succ y_l | x) = \sigma\left(\left[\beta \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} + \beta \log Z(x)\right] - \left[\beta \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)} + \beta \log Z(x)\right]\right)$

Simplifying the expression inside the sigmoid:

$= \sigma\left(\beta \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} + \beta \log Z(x) - \beta \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)} - \beta \log Z(x)\right)$

The $\beta \log Z(x)$ terms cancel:

$= \sigma\left(\beta \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)$

Now recall the critical insight from Section II where we mentioned that the Bradley-Terry model depends only on reward differences. Thus, when we compute $r (x, y_{w}) - r (x, y_{l})$ the intractable partition function $Z (x)$ cancels out. This is what makes DPO possible.

We can write this more cleanly by defining the implicit reward in terms of the optimal policy:

$\hat{r}(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)}$

Thus:

$P(y_w \succ y_l | x) = \sigma\left(\hat{r}(x, y_w) - \hat{r}(x, y_l)\right)$

We dont actually have access to the optimal policy $π^{*} \pi^*$ . But we can parameterize a policy $\pi_\theta$ and optimize it to maximize the likelihood of the observed preferences. This is exactly what the reward model loss (II.III) does except now our reward is implicitly defined by the policy itself.

The DPO loss is the negative log-likelihood:

$\boxed{\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]}$

for implicit reward notation, it can be written as:

$\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\left[\log \sigma\left(\hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l)\right)\right] \tag{V.II}$

Some key insights from the above DPO loss:

The policy implicitly defines its own reward via the log-ratio with the reference. There is no separate reward model.
This is just a supervised classification loss on preference pairs with no RL.
DPO uses the fixed preference dataset $\mathcal{D}$ . Thus, no sampling during training.
No value function needed since we're not doing policy gradients
DPO still optimizes the KL-constrained reward maximization objective but in a different way

VI: Building Intuition for DPO

Now that we have the DPO loss we can build some intuition around the implicit reward model and its gradient updates.

Implicit Reward Model

The DPO paper subtitle is "Your Language Model is Secretly a Reward Model" and this captures the key insight as we are using the LLM for implicit reward. The policy $\pi_\theta$ defines an implicit reward function:

$\hat{r}_\theta(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$

This reward measures how much more likely the current policy is to generate response $y$ compared to the reference policy, scaled by $\beta$ .

If $\pi_\theta(y|x) > \pi_{\text{ref}}(y|x)$ : The implicit reward is positive (the policy "likes" this response more than reference)
If $\pi_\theta(y|x) < \pi_{\text{ref}}(y|x)$ : The implicit reward is negative (the policy "likes" this response less than reference)

Analyzing the Gradient Update

We can flex our brain muscles one more time and compute the gradient for the DPO objective:

$\mathcal{L}_{\text{DPO}} = -\mathbb{E}\left[\log \sigma\left(\hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l)\right)\right]$

Let $u = \hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l)$ . Using the chain rule:

$\nabla_\theta \mathcal{L}_{\text{DPO}} = -\mathbb{E}\left[\frac{\sigma'(u)}{\sigma(u)} \nabla_\theta u\right]$

Using the property $\sigma'(u) = \sigma(u)(1 - \sigma(u))$ :

$= -\mathbb{E}\left[\frac{\sigma(u)(1-\sigma(u))}{\sigma(u)} \nabla_\theta u\right] = -\mathbb{E}\left[(1 - \sigma(u)) \nabla_\theta u\right]$

Using $1 - \sigma(u) = \sigma(-u)$ :

$= -\mathbb{E}\left[\sigma(-u) \nabla_\theta u\right]$

Now, $-u = \hat{r}_\theta(x, y_l) - \hat{r}_\theta(x, y_w)$ , and:

$\nabla_\theta u = \nabla_\theta \hat{r}_\theta(x, y_w) - \nabla_\theta \hat{r}_\theta(x, y_l) = \beta \nabla_\theta \log \pi_\theta(y_w|x) - \beta \nabla_\theta \log \pi_\theta(y_l|x)$

Putting it together:

$\boxed{\nabla_\theta \mathcal{L}_{\text{DPO}} = -\beta \mathbb{E}\left[\underbrace{\sigma\left(\hat{r}_\theta(x, y_l) - \hat{r}_\theta(x, y_w)\right)}_{\text{weight}}\left(\underbrace{\nabla_\theta \log \pi_\theta(y_w|x)}_{\text{increase } y_w} - \underbrace{\nabla_\theta \log \pi_\theta(y_l|x)}_{\text{decrease } y_l}\right)\right]}$

$\nabla_\theta \log \pi_\theta(y_w|x)$ points in the direction that increases probability of the preferred response
$-\nabla_\theta \log \pi_\theta(y_l|x)$ points in the direction that decreases probability of the dispreferred response
The weight term is high when $\hat{r}_\theta(x, y_l) > \hat{r}_\theta(x, y_w)$ $\overset{r}{^}_{θ} (x, y_{l}) > \overset{r}{^}_{θ} (x, y_{w})$ , i.e. when the model currently assigns higher implicit reward to the losing response than the winning response. In other words:
- When the model is wrong (ranks $y_{l}$ above $y_{w}$ ), we get large gradient updates
- When the model is right (ranks $y_{w}$ above $y_{l}$ ), we get small gradient updates

This dynamic sigmoid weighting is crucial. It naturally focuses learning on the examples the model currently gets wrong.

VII: Computing Log Probabilities in Practice

This section is fully adapted from Umar Jamil's video. I think it is essential to understand how log probabilities are computed in practice.

The DPO loss requires computing $\log \pi_\theta(y|x)$ , the log probability of a complete response $y$ given a prompt $x$ . Let's see how this works in practice with LLMs.

Language models are autoregressive: they generate text one token at a time, conditioning on all previous tokens. For a response $y = (y_1, y_2, \ldots, y_T)$ , the probability factorizes as:

$\pi_\theta(y|x) = \prod_{t=1}^{T} \pi_\theta(y_t | x, y_1, \ldots, y_{t-1}) = \prod_{t=1}^{T} \pi_\theta(y_t | x, y_{<t})$

Taking the logarithm:

$\boxed{\log \pi_\theta(y|x) = \sum_{t=1}^{T} \log \pi_\theta(y_t | x, y_{<t})} \tag{VII.I}$

The log probability of the full response is the sum of log probabilities at each position.

Here's how to compute $\log \pi_\theta(y|x)$ :

Prepare input: Concatenate the prompt and response into a single sequence

$\text{input} = [x_1, x_2, \ldots, x_n, y_1, y_2, \ldots, y_T]$
Forward pass: Run the transformer to get hidden states at each position
Project to logits: Apply the language model head (typically a linear layer) to get vocabulary logits at each position
Log softmax: Convert logits to log probabilities over the vocabulary using logsoftmax
Gather relevant log probs: For each position $t$ in the response extract the log probability of the actual next token $y_{t}$ (since we know the output)
Sum with masking: Sum the log probabilities but only for response tokens (not prompt tokens)

$\log \pi_\theta(y|x) = \sum_{t \in \text{res pos}} \log \pi_\theta(y_t | x, y_{<t})$

This gives us $\log \pi_\theta(y|x)$ for one response. We do this for both the preferred response $y_{w}$ and the dispreferred response $y_{l}$ and we also do it for both the policy model $\pi_\theta$ and the frozen reference model $\pi_{\text{ref}}$ . With these four log probabilities in hand, we can compute the DPO loss.

Conclusion

Once you derive the DPO loss, you start appreciating the simplicity and elegance of the solution especially when compared to PPO. The derivation hinges on one observation that the Bradley-Terry model only cares about reward differences and this causes the intractable partition function from analytical solution to cancel out completely. In turn, what remains is a straightforward classification loss.

References

Papers:
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model: The original DPO paper
- Training language models to follow instructions with human feedback: The InstructGPT paper that established PPO-based RLHF
Videos:
- Umar Jamil's video on DPO: Excellent walkthrough of the DPO derivation

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote