DPO | PGW’s notes

This blog is trying to understand how does the DPO work

RL Bradley-Terry model

RL-based objective used by existing methods can be optimized exactly with simple binary cross-entropy objective.

At high level. the objective of RLHF is maximize the reward without drifting too much from the original model (that’s why we need baseline model)

Sampling from the LM polciy is k-step bootstrapping is the root cause of why RLHF is not stable, and so expensive.

DPO uses a change of variables to define the preference loss as a function of the policy.

Bandit and RL settings. Contextual bandit learning using preferences or rankings of actions rather than rewards.

How does DPO work

RL approach: the reward model needed. DPO: change-of-variables. mapping from reward functions to optimal policies. transform a loss function over reward functions into a loss function over policies.

DPO objective

Rewrite the RL objective with partition function. (Partition function)

For a probabilistic model, the partition function $Z$ is defined as the sum (or integral, in the case of continuous variables) over all possible states of the system of the exponential of the negative energy (or cost) associated with each state, divided by a temperature parameter (in machine learning, this temperature parameter is often set to 1 for simplicity, making it less visible in equations).

Z = \sum_{x} e^{-\frac{E(x)}{T}}

where:

$x$ represents a possible state of the system.
$E(x)$ is the energy (or cost) function associated with state $x$.
$T$ is the temperature parameter, controlling the distribution’s sharpness.

The partition function does not depend on policy, it only depends on the reference policy.

Get rid of the patition function using the Bradley-Terry model

Given two items, $A$ and (B), with parameters (p_A) and (p_B) representing their respective strengths or abilities, the Bradley-Terry model predicts the probability that (A) is preferred over (B) as follows:

P(A > B) = \frac{p_A}{p_A + p_B}

If we apply this function with the RL objective with partition function, we can get rid of the reward model part in the objective formula.

Final gradient of the loss function

\nabla_{\theta}L_{DPO}(\pi_{\theta}; \pi_{\text{ref}}) = - \beta \mathbb{E}_{(x,y_w,y_l) \sim D} \left[ \sigma(\hat{r}_{\theta}(x, y_l) - \hat{r}_{\theta}(x, y_w)) \right.

\left. \vphantom{\mathbb{E}} \quad \underbrace{|}_{\text{higher weight when reward estimate is wrong}} \right] \nabla_{\theta} \log \pi(y_w | x) \underbrace{|}_{\text{increase likelihood of } y_w}

- \nabla_{\theta} \log \pi(y_l | x) \underbrace{|}_{\text{decrease likelihood of } y_l}

Note that it’s trying to increase likelihood of winning y, descrese the likelihood of losing y. also if the reward estimate is wrong, the weight is higher.

How to do DPO

Sample completions for every prompt, label with huam preferences.
Optimize the llm to minimize the loss

RLHF-PPO-part2

Pix2Struct