Group Relative Policy Optimization (GRPO) is an innovative reinforcement learning algorithm aimed at enhancing large language models (LLMs) for reasoning tasks. This guide explains the GRPO process with detailed diagrams and step-by-step explanations.
The core GRPO process is depicted as a circular workflow with five key stages:
flowchart TD
A[1. Generate multiple answers\nfor each question] -->|Forward pass| B[2. Score each answer\nusing a reward model]
B --> C[3. Calculate the average score\nfor the group of answers]
C --> D[4. Compare each score to the average\nto determine advantage]
D --> E[5. Update model to favor higher advantages]
E -->|Next iteration| A
subgraph "Key Benefits"
F[• Memory Efficient\n• Simpler Implementation\n• More Stable Training\n• Better Scalability]
end
The process begins by generating diverse answers for the same question using temperature-based sampling:
flowchart LR
A[Input Question] --> B[Policy Model πθ]
B --> C{Sampling with Temperature}
C --> D[Answer 1]
C --> E[Answer 2]
C --> F[Answer 3]
C --> G[Answer N]
subgraph "Group G of Answers"
D
E
F
G
end
The policy model receives a question and generates a diverse set of answers (typically 4–8 per question) to build a robust group of responses.
Each answer is scored using a reward model, and the algorithm calculates group statistics:
flowchart LR
A[Group of Answers] --> B[Reward Model Rϕ]
B --> C[Score for Answer 1]
B --> D[Score for Answer 2]
B --> E[Score for Answer 3]
B --> F[Score for Answer N]
%% Group Statistics and Mathematical Representation
subgraph "Mathematical Representation"
G["mean(G) = (1/N) × Σ R(ri)"]
H["std(G) = √[(1/N) × Σ (R(ri) - mean(G))²]"]
end
The reward model assigns a score to each answer. Then, the mean and standard deviation of all scores in the group are calculated to establish a performance baseline.
Instead of using a separate value network (as in PPO), GRPO compares each answer’s score to the group average to determine its "advantage":
flowchart LR
A[Individual Score] --> B{Compare to Group Average}
C[Group Mean] --> B
D[Group Std Dev] --> B
B --> E[Advantage Calculation]
subgraph "Advantage Formula"
F["Ai = (R(ri) - mean(G))/std(G)"]
end
This normalization process creates a clear learning signal by highlighting which answers performed above or below average.
Using the computed advantages, the model is updated to reinforce effective reasoning patterns while maintaining stability:
flowchart TD
A[Advantages] --> B[GRPO Objective Function]
C[Policy Model πθ] --> B
D[Reference Model πref] --> B
B --> E[Gradient Update]
E --> F[Updated Policy Model]
subgraph "GRPO Objective"
G["JGRPO(θ) = min(ratio × advantage, clipped_ratio × advantage) - β × KL_divergence"]
end
The GRPO objective updates the policy model to favor patterns that lead to higher advantages. The KL divergence term prevents the model from straying too far from a reference policy.
A side-by-side comparison shows the fundamental differences between PPO and GRPO:
flowchart TD
subgraph "PPO Architecture"
A1[Policy Model] --> B1[Generate Answer]
B1 --> C1[Reward Model]
C1 --> D1[Value Network]
D1 --> E1[Advantage Estimation]
E1 --> F1[Policy Update]
end
subgraph "GRPO Architecture"
A2[Policy Model] --> B2[Generate Multiple Answers]
B2 --> C2[Reward Model]
C2 --> D2[Group Statistics]
D2 --> E2[Advantage Calculation]
E2 --> F2[Policy Update]
end
G[Main Difference] --> H[GRPO eliminates Value Network]
G --> I[GRPO uses group statistics as baseline]
The key innovation in GRPO is the elimination of a separate value network. Instead, GRPO relies on group statistics to compute advantages, resulting in a more memory-efficient and simpler implementation.
The overall training pipeline alternates between supervised fine-tuning (SFT) and GRPO phases:
flowchart TD
A[Pre-trained LLM] --> B[Stage 1: SFT]
B --> C[Stage 2: GRPO Training]
C --> D[Stage 3: SFT with Synthetic Data]
D --> E[Stage 4: GRPO Alignment]
E --> F[Final Reasoning-Enhanced LLM]
subgraph "SFT Data"
G[High-quality Expert Demonstrations]
end
subgraph "GRPO Components"
H[Policy Model]
I[Reward Functions]
J[Group Advantage Estimation]
end
subgraph "Synthetic Data Generation"
K[Generate Examples]
L[LLM-as-Judge Filtering]
M[Quality Control]
end
The pipeline includes initial SFT on expert data, followed by GRPO training to enhance reasoning, synthetic data SFT for expansion, and a final GRPO alignment phase for improved helpfulness and safety.
Parameter | Description | Recommended Range | Effect |
---|---|---|---|
Group Size (G) | Number of answers per question | 4–8 | Higher → Better baseline estimate |
KL Weight (β) | Controls policy drift | 0.0001–0.001 | Higher → Less policy drift |
Clipping (ε) | Limits policy updates | 0.1–0.3 | Higher → More conservative updates |
Learning Rate | Step size for updates | 1e-5 to 1e-6 | Lower → More stable training |
Careful tuning of these hyperparameters helps optimize GRPO for various model sizes and reasoning tasks.
GRPO represents a significant advancement in reinforcement learning for LLMs by:
- Eliminating the Value Network: Simplifies the architecture and reduces memory overhead.
- Group-based Advantage Estimation: Leverages the collective performance of multiple answers for dynamic baseline estimation.
- Enhanced Scalability: Makes training more efficient even on consumer hardware.
This approach democratizes advanced RL training, enabling researchers and developers to build specialized reasoning models with fewer resources.