Skip to content

Commit 482d058

Browse files
add in grpo.md
Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
1 parent 9270595 commit 482d058

1 file changed

Lines changed: 14 additions & 0 deletions

File tree

docs/guides/grpo.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,20 @@ where:
126126
- $\beta$ is the KL penalty coefficient
127127
- $\pi_{\text{ref}}$ is the reference policy
128128

129+
Also supports "Dual-Clipping" from https://arxiv.org/pdf/1912.09729, which
130+
imposes an additional upper bound on the probability ratio when advantages are negative.
131+
This prevents excessive policy updates. $rA \ll 0$ -> $cA$(clipped).
132+
The loss function is modified to the following when A_t < 0:
133+
134+
$$
135+
L(\theta) = E_t \Big[ \max \Big( \min \big(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon) A_t \big), c A_t \Big) \Big] - \beta D_{\text{KL}} (\pi_\theta \| \pi_\text{ref})
136+
$$
137+
138+
where:
139+
- c is the dual-clip parameter (ratio_clip_c), which must be greater than 1 and is
140+
usually set as 3 empirically
141+
- $r_t(\theta)$ is the ratio $\frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}$ that measures how much the policy has change
142+
129143
#### Improvements to the GRPO loss formulation for stability and accuracy
130144

131145
#### On-Policy KL Approximation

0 commit comments

Comments
 (0)