Jump to content

Draft:Group Relative Policy Optimization

From Wikipedia, the free encyclopedia

Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm introduced by researchers at DeepSeek in 2024.[1] The algorithm modifies the widely-used Proximal Policy Optimization (PPO) approach by eliminating the critic network and instead computing advantage estimates from reward statistics within each batch of sampled actions.

Method

[edit]

Traditional PPO implementations use an actor-critic architecture with separate policy and value networks. GRPO removes the value network entirely, reducing computational overhead and memory requirements during training.[1]

For a given state, GRPO samples multiple actions and computes advantages by comparing each action's reward to the group statistics. The advantage function is:

where and are the mean and standard deviation of rewards within the sampled group. This normalization ensures that advantages are computed relative to the current batch rather than requiring a separate value function approximation.

The policy update uses a clipped objective similar to PPO:

where represents the probability ratio between current and old policies, and the KL divergence term prevents excessive policy changes.

Applications

[edit]

GRPO was first applied to train mathematical reasoning models, including the DeepSeekMath 7B model.[1] The algorithm has since been used in training the DeepSeek-R1 series, which demonstrated improved performance on reasoning benchmarks.[2]

Several machine learning frameworks have incorporated GRPO implementations, including the Hugging Face Transformers Reinforcement Learning (TRL) library and Unsloth's fine-tuning toolkit.

See also

[edit]

References

[edit]
  1. ^ a b c Shao, Zhihong; Wang, Peiyi; Zhu, Qihao et al. (2024-02-05). "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". arXiv:2402.03300 [cs.CL].{{cite arXiv}}: CS1 maint: multiple names: authors list (link)
  2. ^ DeepSeek-AI; et al. (2025-01-22). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning". arXiv:2501.12948 [cs.CL].