In my previous post, I worked through the derivation of the PPO loss used in RLHF for LLMs. By the end, we arrived at a fairly daunting objective function with multiple components: clipped surrogate, value function loss, entropy bonus and KL penalty. It is not just that the final objective is intimidating but the entire RLHF pipeline is complex and multi-step. You first train a separate reward model to reflect human preferences then fine-tune the LLM using RL with PPO.
That brings us to Direct Preference Optimization (DPO). DPO is a computationally lightweight alternative that directly optimizes LLMs to adhere to human preferences without explicit reward modeling or reinforcement learning. The key insight is that DPO implicitly optimizes the same objective as PPO-based RLHF (reward maximization with a KL-divergence constraint) but it replaces the entire reward model + PPO loop with a single supervised objective on preference pairs. There is no sampling during training, no value function, no clipping, just a classification loss!
Here I derive the DPO loss showing exactly how this simplification is possible. I will assume familiarity with concepts from the PPO post, particularly the reward model and the KL-constrained RLHF objective.
Again, a huge shoutout to Umar Jamil's video on DPO for an excellent walkthrough that helped me understand the derivation.
I: The RLHF Objective
Let's recall the RLHF objective from the PPO blog. The goal of RLHF is to find a policy πθ that maximizes expected reward while staying close to a reference model πref:
The first term encourages the model to generate high-reward responses. The second term (KL penalty) prevents the model from drifting too far from the reference which helps avoid reward hacking and maintains language quality.
As we saw in the PPO blog, we can't optimize this objective directly with gradient descent because the expectation Ey∼πθ(y∣x)[⋅] requires sampling from the policy and sampling is non-differentiable. This is why we needed reinforcement learning algorithms like REINFORCE and PPO. They provide ways to estimate policy gradients without differentiating through the sampling process.
What if we could reformulate the problem so that we don't need to sample from the policy during training? This is exactly what DPO will achieve.
II: The Bradley-Terry Model for Preference Learning
We also need to understand the Bradley-Terry model for reward model training in a bit more detail with focus on why it works the way it does.
Training a reward model requires human-labeled preference data that compares pairs of responses.
D={(x(i),yw(i),yl(i))}i=1N
where:
x is the prompt
yw is the preferred (winning) response
yl is the dispreferred (losing) response
The Bradley-Terry Probability Model
The Bradley-Terry model provides a principled way to convert comparison data (defined above) into a probabilistic model. It assumes there exists some latent reward function r∗(x,y) that captures true response quality and models the probability that response yw is preferred over yl as:
The intuition is straightforward that the responses with higher reward are exponentially more likely to be preferred.
A key step is recognizing that this ratio of exponentials can be written as a sigmoid function. This is important because it connects Bradley-Terry to standard binary classification.
Let A=r(x,yw) and B=r(x,yl). We want to show:
eA+eBeA=σ(A−B)
where σ(z)=1+e−z1 is the sigmoid function.
Starting with the left-hand side:
eA+eBeA
we can rewrite it as:
=(eA+eB)/eAeA/eA=1+eB/eA1=1+eB−A1
which is the sigmoid function of $(A - B)$:
=1+e−(A−B)1=σ(A−B)
Therefore, the Bradley-Terry model can be written as:
P(yw≻yl∣x)=σ(r(x,yw)−r(x,yl))(II.II)
Reward Model Loss
Given a dataset of preferences D, we can train a parameterized reward model rϕ(x,y) using maximum likelihood estimation. We want to maximize the probability of observing the preferences in our dataset:
ϕmax(x,yw,yl)∈D∏P(yw≻yl∣x)
Taking the log and negating (to turn maximization into minimization), we get the negative log-likelihood loss:
This is just binary cross-entropy objective which helps the reward model learn to assign higher rewards to preferred responses. Notice that the Bradley-Terry model depends only on the difference of rewards: r(x,yw)−r(x,yl). The absolute values dont matter, it is only their relative ordering. This means:
If we add any constant c or any function f(x) that depends only on the prompt (not the response), the preference probabilities dont change. This invariance property will be the key to deriving DPO.
III: Optimal Policy in Closed Form
Here, we will find the exact analytical optimal policy solution to the optimization problem. We want to find the policy that maximizes expected reward while keeping the KL divergence from the reference policy bounded:
For a fixed prompt x, we want to find the distribution π(⋅∣x) that maximizes:
Ey∼π(y∣x)[r(x,y)−βlogπref(y∣x)π(y∣x)]
This is a constrained optimization problem over probability distributions. We can solve it using the method of Lagrange multipliers, enforcing that π(y∣x) sums to 1. For discrete y:
Taking the derivative with respect to $\pi(y|x)$ and setting it to zero (stationary point):
∂π(y∣x)∂L=r(x,y)−βlogπref(y∣x)π(y∣x)−β−λ=0
Now, solving for π(y∣x):
logπref(y∣x)π(y∣x)=β1(r(x,y)−β−λ)
πref(y∣x)π(y∣x)=exp(β1r(x,y))⋅exp(−1−βλ)
π(y∣x)=πref(y∣x)⋅exp(β1r(x,y))⋅exp(−1−βλ)
The term exp(−1−βλ) is a constant (with respect to y) that ensures normalization. To find its value, we enforce that π(y∣x) must be a valid probability distribution and sum to 1:
y∑π(y∣x)=1
Substituting our expression for $\pi(y|x)$:
y∑πref(y∣x)⋅exp(β1r(x,y))⋅exp(−1−βλ)=1
Since exp(−1−βλ) doesn't depend on y, we can factor it out of the sum:
exp(−1−βλ)⋅y∑πref(y∣x)⋅exp(β1r(x,y))=1
Solving for the constant:
exp(−1−βλ)=∑yπref(y∣x)⋅exp(β1r(x,y))1
We define this normalizing sum as the partition functionZ(x):
Z(x)=y∑πref(y∣x)exp(β1r(x,y))(III.II)
Substituting back, we get the optimal policy:
πr(y∣x)=Z(x)1πref(y∣x)exp(β1r(x,y))(III.III)
We have an exact closed-form expression for the optimal policy. However, we cannot compute it directly because Z(x) is intractable. To compute it, we need to sum over all possible responsesy which not possible.
IV: The Reparameterization Trick
The key insight of DPO is to flip the relationship between reward and policy. Now, we frame the problem as: "given an optimal policy what reward function does it correspond to?"
Starting from the optimal policy equation (III.III):
πr(y∣x)=Z(x)1πref(y∣x)exp(β1r(x,y))
We solve for the reward r(x,y) by first taking the log of both sides:
logπr(y∣x)=logπref(y∣x)+β1r(x,y)−logZ(x)
Now rearrange to get r(x,y) on left-side:
β1r(x,y)=logπr(y∣x)−logπref(y∣x)+logZ(x)
r(x,y)=βlogπr(y∣x)−βlogπref(y∣x)+βlogZ(x)
This can be written more compactly as:
r(x,y)=βlogπref(y∣x)πr(y∣x)+βlogZ(x)(IV.I)
The reward is expressed as:
A term involving the log-ratio of the optimal policy to the reference policy
βlogZ(x) which depends only on x (not on y)
V: Deriving the DPO Loss
Finally, we have all the pieces to derive the DPO loss. From Section II, the Bradley-Terry preference model is:
P(yw≻yl∣x)=σ(r(x,yw)−r(x,yl))
From Section IV, assuming we have access to an optimal policy π∗, the reward can be written as:
Now recall the critical insight from Section II where we mentioned that the Bradley-Terry model depends only on reward differences. Thus, when we compute r(x,yw)−r(x,yl) the intractable partition function Z(x) cancels out. This is what makes DPO possible.
We can write this more cleanly by defining the implicit reward in terms of the optimal policy:
r^(x,y)=βlogπref(y∣x)π∗(y∣x)
Thus:
P(yw≻yl∣x)=σ(r^(x,yw)−r^(x,yl))
We dont actually have access to the optimal policy π∗. But we can parameterize a policy πθ and optimize it to maximize the likelihood of the observed preferences. This is exactly what the reward model loss (II.III) does except now our reward is implicitly defined by the policy itself.
The policy implicitly defines its own reward via the log-ratio with the reference. There is no separate reward model.
This is just a supervised classification loss on preference pairs with no RL.
DPO uses the fixed preference dataset D. Thus, no sampling during training.
No value function needed since we're not doing policy gradients
DPO still optimizes the KL-constrained reward maximization objective but in a different way
VI: Building Intuition for DPO
Now that we have the DPO loss we can build some intuition around the implicit reward model and its gradient updates.
Implicit Reward Model
The DPO paper subtitle is "Your Language Model is Secretly a Reward Model" and this captures the key insight as we are using the LLM for implicit reward. The policy πθ defines an implicit reward function:
r^θ(x,y)=βlogπref(y∣x)πθ(y∣x)
This reward measures how much more likely the current policy is to generate response y compared to the reference policy, scaled by β.
If πθ(y∣x)>πref(y∣x): The implicit reward is positive (the policy "likes" this response more than reference)
If πθ(y∣x)<πref(y∣x): The implicit reward is negative (the policy "likes" this response less than reference)
Analyzing the Gradient Update
We can flex our brain muscles one more time and compute the gradient for the DPO objective:
LDPO=−E[logσ(r^θ(x,yw)−r^θ(x,yl))]
Let u=r^θ(x,yw)−r^θ(x,yl). Using the chain rule:
∇θlogπθ(yw∣x) points in the direction that increases probability of the preferred response
−∇θlogπθ(yl∣x) points in the direction that decreases probability of the dispreferred response
The weight term is high when r^θ(x,yl)>r^θ(x,yw), i.e. when the model currently assigns higher implicit reward to the losing response than the winning response. In other words:
When the model is wrong (ranks yl above yw), we get large gradient updates
When the model is right (ranks yw above yl), we get small gradient updates
This dynamic sigmoid weighting is crucial. It naturally focuses learning on the examples the model currently gets wrong.
VII: Computing Log Probabilities in Practice
This section is fully adapted from Umar Jamil's video. I think it is essential to understand how log probabilities are computed in practice.
The DPO loss requires computing logπθ(y∣x), the log probability of a complete response y given a prompt x. Let's see how this works in practice with LLMs.
Language models are autoregressive: they generate text one token at a time, conditioning on all previous tokens. For a response y=(y1,y2,…,yT), the probability factorizes as:
The log probability of the full response is the sum of log probabilities at each position.
Here's how to compute logπθ(y∣x):
Prepare input: Concatenate the prompt and response into a single sequence
input=[x1,x2,…,xn,y1,y2,…,yT]
Forward pass: Run the transformer to get hidden states at each position
Project to logits: Apply the language model head (typically a linear layer) to get vocabulary logits at each position
Log softmax: Convert logits to log probabilities over the vocabulary using logsoftmax
Gather relevant log probs: For each position t in the response extract the log probability of the actual next token yt (since we know the output)
Sum with masking: Sum the log probabilities but only for response tokens (not prompt tokens)
logπθ(y∣x)=∑t∈res poslogπθ(yt∣x,y<t)
This gives us logπθ(y∣x) for one response. We do this for both the preferred response yw and the dispreferred response yl and we also do it for both the policy model πθ and the frozen reference model πref. With these four log probabilities in hand, we can compute the DPO loss.
Conclusion
Once you derive the DPO loss, you start appreciating the simplicity and elegance of the solution especially when compared to PPO. The derivation hinges on one observation that the Bradley-Terry model only cares about reward differences and this causes the intractable partition function from analytical solution to cancel out completely. In turn, what remains is a straightforward classification loss.