Skip to content

Instantly share code, notes, and snippets.

@CraftsMan-Labs
Last active March 21, 2025 15:10
Show Gist options
  • Save CraftsMan-Labs/82bf407b313845b1c3af6eaa66d85e71 to your computer and use it in GitHub Desktop.
Save CraftsMan-Labs/82bf407b313845b1c3af6eaa66d85e71 to your computer and use it in GitHub Desktop.

Group Relative Policy Optimization (GRPO): A Comprehensive Guide

Group Relative Policy Optimization (GRPO) is an innovative reinforcement learning algorithm aimed at enhancing large language models (LLMs) for reasoning tasks. This guide explains the GRPO process with detailed diagrams and step-by-step explanations.


Main GRPO Workflow

The core GRPO process is depicted as a circular workflow with five key stages:

flowchart TD
    A[1. Generate multiple answers\nfor each question] -->|Forward pass| B[2. Score each answer\nusing a reward model]
    B --> C[3. Calculate the average score\nfor the group of answers]
    C --> D[4. Compare each score to the average\nto determine advantage]
    D --> E[5. Update model to favor higher advantages]
    E -->|Next iteration| A
    
    subgraph "Key Benefits"
    F[• Memory Efficient\n• Simpler Implementation\n• More Stable Training\n• Better Scalability]
    end
Loading

Stage 1: Generate Multiple Answers

The process begins by generating diverse answers for the same question using temperature-based sampling:

flowchart LR
    A[Input Question] --> B[Policy Model πθ]
    B --> C{Sampling with Temperature}
    C --> D[Answer 1]
    C --> E[Answer 2]
    C --> F[Answer 3]
    C --> G[Answer N]
    
    subgraph "Group G of Answers"
    D
    E
    F
    G
    end
Loading

The policy model receives a question and generates a diverse set of answers (typically 4–8 per question) to build a robust group of responses.


Stage 2: Score Each Answer

Each answer is scored using a reward model, and the algorithm calculates group statistics:

flowchart LR
    A[Group of Answers] --> B[Reward Model Rϕ]
    B --> C[Score for Answer 1]
    B --> D[Score for Answer 2]
    B --> E[Score for Answer 3]
    B --> F[Score for Answer N]
    
    %% Group Statistics and Mathematical Representation
    subgraph "Mathematical Representation"
    G["mean(G) = (1/N) × Σ R(ri)"]
    H["std(G) = √[(1/N) × Σ (R(ri) - mean(G))²]"]
    end
Loading

The reward model assigns a score to each answer. Then, the mean and standard deviation of all scores in the group are calculated to establish a performance baseline.


Stage 3: Advantage Estimation via Group Statistics

Instead of using a separate value network (as in PPO), GRPO compares each answer’s score to the group average to determine its "advantage":

flowchart LR
    A[Individual Score] --> B{Compare to Group Average}
    C[Group Mean] --> B
    D[Group Std Dev] --> B
    B --> E[Advantage Calculation]
    
    subgraph "Advantage Formula"
    F["Ai = (R(ri) - mean(G))/std(G)"]
    end
Loading

This normalization process creates a clear learning signal by highlighting which answers performed above or below average.


Stage 4: Update Model Parameters

Using the computed advantages, the model is updated to reinforce effective reasoning patterns while maintaining stability:

flowchart TD
    A[Advantages] --> B[GRPO Objective Function]
    C[Policy Model πθ] --> B
    D[Reference Model πref] --> B
    B --> E[Gradient Update]
    E --> F[Updated Policy Model]
    
    subgraph "GRPO Objective"
    G["JGRPO(θ) = min(ratio × advantage, clipped_ratio × advantage) - β × KL_divergence"]
    end
Loading

The GRPO objective updates the policy model to favor patterns that lead to higher advantages. The KL divergence term prevents the model from straying too far from a reference policy.


GRPO vs. PPO: Key Differences

A side-by-side comparison shows the fundamental differences between PPO and GRPO:

flowchart TD
    subgraph "PPO Architecture"
    A1[Policy Model] --> B1[Generate Answer]
    B1 --> C1[Reward Model]
    C1 --> D1[Value Network]
    D1 --> E1[Advantage Estimation]
    E1 --> F1[Policy Update]
    end
    
    subgraph "GRPO Architecture"
    A2[Policy Model] --> B2[Generate Multiple Answers]
    B2 --> C2[Reward Model]
    C2 --> D2[Group Statistics]
    D2 --> E2[Advantage Calculation]
    E2 --> F2[Policy Update]
    end
    
    G[Main Difference] --> H[GRPO eliminates Value Network]
    G --> I[GRPO uses group statistics as baseline]
Loading

The key innovation in GRPO is the elimination of a separate value network. Instead, GRPO relies on group statistics to compute advantages, resulting in a more memory-efficient and simpler implementation.


Complete Training Pipeline

The overall training pipeline alternates between supervised fine-tuning (SFT) and GRPO phases:

flowchart TD
    A[Pre-trained LLM] --> B[Stage 1: SFT]
    B --> C[Stage 2: GRPO Training]
    C --> D[Stage 3: SFT with Synthetic Data]
    D --> E[Stage 4: GRPO Alignment]
    E --> F[Final Reasoning-Enhanced LLM]
    
    subgraph "SFT Data"
    G[High-quality Expert Demonstrations]
    end
    
    subgraph "GRPO Components"
    H[Policy Model]
    I[Reward Functions]
    J[Group Advantage Estimation]
    end
    
    subgraph "Synthetic Data Generation"
    K[Generate Examples]
    L[LLM-as-Judge Filtering]
    M[Quality Control]
    end
Loading

The pipeline includes initial SFT on expert data, followed by GRPO training to enhance reasoning, synthetic data SFT for expansion, and a final GRPO alignment phase for improved helpfulness and safety.


Hyperparameter Optimization

Parameter Description Recommended Range Effect
Group Size (G) Number of answers per question 4–8 Higher → Better baseline estimate
KL Weight (β) Controls policy drift 0.0001–0.001 Higher → Less policy drift
Clipping (ε) Limits policy updates 0.1–0.3 Higher → More conservative updates
Learning Rate Step size for updates 1e-5 to 1e-6 Lower → More stable training

Careful tuning of these hyperparameters helps optimize GRPO for various model sizes and reasoning tasks.


Conclusion

GRPO represents a significant advancement in reinforcement learning for LLMs by:

  • Eliminating the Value Network: Simplifies the architecture and reduces memory overhead.
  • Group-based Advantage Estimation: Leverages the collective performance of multiple answers for dynamic baseline estimation.
  • Enhanced Scalability: Makes training more efficient even on consumer hardware.

This approach democratizes advanced RL training, enabling researchers and developers to build specialized reasoning models with fewer resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment