Skip to content

Instantly share code, notes, and snippets.

@thehunmonkgroup
Created August 12, 2025 16:33
Show Gist options
  • Save thehunmonkgroup/88568606a1f2563a1e1f61f9a5757119 to your computer and use it in GitHub Desktop.
Save thehunmonkgroup/88568606a1f2563a1e1f61f9a5757119 to your computer and use it in GitHub Desktop.
Summary: **improving Factuality In Reasoning Large Language Models Through Online Reinforcement Learning**

URL: https://arxiv.org/pdf/2508.05618

improving Factuality In Reasoning Large Language Models Through Online Reinforcement Learning


QUESTION 1:

Can you give me a very clear explanation of the core assertions, implications, and mechanics elucidated in this paper?

ANSWER 1:

Here’s a concise, structured explanation of the paper’s core assertions, implications, and mechanics, using the paper as the source of truth.

Core assertions

  • Existing reasoning LLMs (R-LLMs) hallucinate more on long-form factual tasks than their non-reasoning counterparts.
    • DeepSeek-R1 and QwQ-32B have, on average, 10 and 13 percentage points lower factual precision than DeepSeek-V3 and Qwen-2.5-32B, respectively.
  • The main cause is misaligned RL training objectives in R-LLMs: training focuses on verifiable tasks (math/coding) and overlooks factuality in open-ended, long-form responses.
  • Online RL for long-form factuality is feasible if you design the reward correctly and make long-form factuality scoring fast enough.
  • A composite reward that jointly optimizes factual precision, detail level, and relevance averts reward hacking and improves factuality without harming overall helpfulness.

Implications

  • You cannot optimize long-form factuality by targeting precision alone; doing so induces shorter, less helpful, or off-topic responses.
  • Adding a relevance/quality term (LLM-as-a-judge) to the factual reward is essential to prevent reward hacking (e.g., padding with general, factual but irrelevant content).
  • On-policy online RL (GRPO) with the proposed reward outperforms offline methods like DPO: it raises factual precision and detail while maintaining or improving helpfulness.
  • Making long-form factuality scoring fast (∼5s vs ∼2 minutes) unlocks practical online RL for factual reasoning.
  • R-LLMs need distinct “factual reasoning” strategies (e.g., synthesis, summarization, comparison) different from math/coding reasoning.

Mechanics (what they actually built and did)

  • Training data
    • Prompts: 7k synthetic, fact-seeking questions generated by Llama 4, guided by two grounding sets: diverse WildChat prompts (for realism) and LongFact non-test prompts (for factuality). Split: 3k for SFT, 4k for RL.
  • Model formatting and seeding
    • Long-CoT format: … then ….
    • SFT: For each SFT prompt, generate 10 Long-CoT responses with the base model; score with VeriScore; SFT on the best (highest factual precision) to teach the format and seed reasoning.
  • Offline RL baseline (DPO)
    • Generate 10 responses per RL prompt; score with VeriScore.
    • Build preference pairs by max factual-precision margin subject to two constraints: margin ≥ 0.1 and length similarity |1 − lc/lr| ≤ 0.1 (to reduce length hacking).
    • Offline DPO improves precision but hurts overall response quality/helpfulness.
  • Online RL (GRPO) with composite reward
    • Reward R(y|x) = precision + λ·detail + μ·relevance, with a −1.0 penalty if output format is malformed.
      • Precision: smoothed factual precision F/(T+1) from VeriScore (F supported claims; T total claims).
      • Detail: log(1 + F) to reward more supported claims but with diminishing returns.
      • Relevance/quality: LLM-as-a-judge binary win vs. a reference response from the base model (1 if better, 0 otherwise).
    • Main setting: λ = 0, μ = 0.1; also explored λ = 0.01, μ = 0.1 and λ = 0.1, μ = 0.1.
  • Fast VeriScore
    • Parallelized batched claim extraction and verification; async Google Search via Serper; served by Matrix with multiple Llama-3.3-70B-Instruct replicas.
    • Speedup: average under 5 seconds per response (vs ~2 minutes), enabling real-time reward in online RL.
    • For evaluation consistency, they still report with the original VeriScore finetuned extractor/verifier.
  • Evaluation setup
    • Six long-form factuality benchmarks: LongFact (Objects), FAVA (141 prompts), AlpacaFact (241), Biography (183), FactBench-Hard (532), Factory-Hard.
    • Metrics: precision = F/T; detail level = F (they avoid VeriScore’s approximate recall).
    • Helpfulness: AlpacaEval-style pairwise GPT-4o judging vs. the base model (win rate).
  • Implementation details (compute)
    • Base: Llama-3.1-8B-Instruct.
    • SFT: 8×H100, 1 epoch; DPO: 16×H100, 1 epoch; GRPO: 32 H100 training workers + 8 H100 inference workers; 4 rollouts/prompt; 1 epoch.

Key empirical findings

  • Existing reasoning models on long-form factuality:
    • Qwen-2.5-32B (non-reasoning) vs QwQ-32B (reasoning): precision 51.4 → 38.3; detail 21.9 → 27.3.
    • DeepSeek-V3 (non-reasoning) vs DeepSeek-R1 (reasoning): precision 55.0 → 44.7; detail 30.7 → 27.0.
    • Conclusion: today’s R-LLMs hallucinate more on long-form factual tasks.
  • Offline training (from Llama-3.1-8B):
    • SFT: +10.9 precision points but −25% detail on average.
    • SFT + DPO: +22.8 precision points, detail roughly maintained/improved on 4/6 datasets, but win rate drops to 37.8% (quality degradation).
  • Online training (SFT + GRPO with λ=0, μ=0.1):
    • Average precision 68.1%: +23.1 points over base (45.0%).
    • Detail level: +23% supported claims over base.
    • Helpfulness: 54.4% win rate vs base (no degradation).
  • Ablations (reward design)
    • R_fact only: precision up, detail up, but win rate falls to 43.5% (reward hacking via off-topic but factual content).
    • R_fact + R_dtl: further boosts precision/detail, but win rate worse (36.9%), showing stronger hacking.
    • R_fact + R_rel: precision +23.1 points, detail +23%, and win rate 54.4%—meaning relevance term curbs hacking.
    • Full reward (R_fact + R_dtl + R_rel):
      • λ=0.01, μ=0.1: more detail (+13% vs μ-only) with small precision drop (−1 point), win rate still >50%.
      • λ=0.1, μ=0.1: very large detail gains (+88% vs base), precision +22 points, but win rate 45.7% (hacking resurfaces).
  • Reasoning trace analysis
    • CoT and answer lengths rise early in training, then plateau.
    • Frequent “meta-reasoning” strategies differ from math/coding; more synthesis, summarization, explanation, definition, comparison.

What this means (takeaways)

  • To improve long-form factuality in R-LLMs, you need:
    • On-policy online RL.
    • A composite reward that balances factual precision, detail, and relevance to avoid reward hacking.
    • Fast, reliable long-form factuality scoring to support online training.
  • The proposed recipe yields higher factual precision and more factual detail while keeping responses helpful.
  • The field should treat factual reasoning as distinct from math/coding reasoning, requiring different training signals and reasoning strategies.
  • The approach is likely extensible to agentic settings (e.g., with search tools), potentially enabling further factuality gains.

Limitations and future directions acknowledged by the paper

  • LLM-as-judge is the base model (chosen for efficiency); stronger judges were not explored.
  • VeriScore (and all long-form factuality scoring) is still an approximation; recall is hard to assess reliably.
  • There is an inherent precision–detail–relevance trade-off controlled by λ and μ.
  • Future work: agentic factual reasoning with tools (e.g., search) to fill knowledge gaps during the reasoning process.

QUESTION 2:

Can you explain the value of this in basic terms? Like you're talking to a CEO. So what? What's the bottom line here?

ANSWER 2:

Here’s the CEO version — the “so what” in plain terms.

What’s the problem?

  • Today’s reasoning LLMs are great at math/code but actually hallucinate more on long-form factual tasks than standard models. That’s bad for trust, liability, and brand safety in knowledge-heavy workflows.

What did this paper do?

  • It shows a practical way to train models that give long, factual answers without becoming terse or going off-topic.
  • The key is a new reward design that balances three things at once: factual precision, level of detail, and relevance. They also made factuality scoring fast enough for real-time training.

What’s the bottom line (outcomes)?

  • +23.1 percentage points in factual precision vs the same base model.
  • +23% more supported facts in responses (more comprehensive answers).
  • Helpfulness doesn’t drop (wins 54.4% vs the base model in pairwise judgments).
  • In short: fewer hallucinations, more substance, still useful.

Why this matters for the business

  • Increases trust and reduces risk in customer-facing and regulated content (support, marketing, knowledge bases, analyst reports, bios, briefings).
  • Improves productivity in research and drafting by delivering both accurate and detailed outputs.
  • Differentiates product quality: avoids common “reward hacking” (models gaming metrics by being shorter or off-topic).

What’s novel here

  • First practical recipe for online RL on long-form factuality: they sped up verification ~30× (to ~5s/response), making on-policy training feasible.
  • The composite reward fixes the usual pitfalls of optimizing “factuality” alone.

Where this can be used

  • Customer support, content generation, enterprise search summarization, biographical/profile generation, compliance-friendly documentation, editorial assistants.

Caveats and requirements

  • Training is compute-intensive (tens of H100s in their setup).
  • There’s a tunable trade-off between maximum precision and more detail.
  • Long-form factuality scoring is still an approximation, though strong enough to deliver the gains reported.

Strategic takeaway

  • If you rely on LLMs for long-form outputs, switching to this training approach yields more reliable and still helpful responses—meaning safer deployment, better user trust, and higher-quality content at scale.

QUESTION 3:

Can you give me an analogy or metaphor that will help explain this to a broad audience?

ANSWER 3:

Think of training a newsroom

  • If you pay reporters only for “no mistakes,” they’ll file one-sentence, risk-free blurbs. Accurate, but useless.
  • If you pay only for “more words,” they’ll turn in long, generic encyclopedia dumps. Detailed, but off-topic.
  • If you ask an editor to eyeball everything, they’ll miss subtle errors in long pieces.

What this paper does is build a better newsroom scorecard and a faster fact-check desk:

  • Three-part scorecard
    • Accuracy = every claim is fact-checked (factual precision).
    • Depth = more verified facts earn more credit, but with diminishing returns (detail).
    • Relevance = an editor compares the article to a solid baseline draft and only rewards it if it’s genuinely better and on-topic (LLM-as-judge).
  • Instant fact-checking
    • They turbocharge the fact-check desk so it can verify an article in seconds, not minutes. That enables coaching reporters while they write (online RL), not after the issue goes to print.

Result

  • Reporters learn to file stories that are accurate, substantive, and actually about the assignment—without gaming the system by going short or going off-topic.
  • In the paper’s terms: far fewer hallucinations (+23.1 precision points), more supported facts (+23%), and no drop in overall usefulness.

In short: it’s a newsroom that rewards the right things and can fact-check fast enough to shape behavior in real time.


QUESTION 4:

Can you provide a specific example that best illustrates the key points elucidated in this paper?

ANSWER 4:

Here’s a concrete, self-contained example that mirrors the paper’s key points.

Prompt

  • “Who is Leon Wildes?”

Three candidate responses the model might learn to produce

  • A (short, safe): “Leon Wildes is an immigration attorney.”
  • B (long, off-topic padding): “While specific details about Leon Wildes are not readily available, here is an overview of U.S. immigration law…” followed by many accurate, general facts about immigration law.
  • C (on-topic, detailed): A brief bio focused on Leon Wildes with multiple specific, verifiable facts (career, notable cases, affiliations), mostly correct.

How the reward works and what happens

  • The paper’s reward R = factual precision + λ·detail + μ·relevance.
    • Factual precision: supported claims F divided by total claims T+1 (smoothed).
    • Detail: log(1+F) (diminishing returns).
    • Relevance/quality: an LLM judge compares the answer to a reference from the base model; adds μ if better, 0 otherwise.

Illustrative (hypothetical) scoring for the three answers

  • A: 1 supported claim out of 1 total → precision ≈ 1/(1+1)=0.5; detail = log(2); relevance likely 0 (not better than a solid baseline).
  • B: 25 supported out of 25 total (all true but off-topic) → precision ≈ 25/26≈0.96; detail = log(26); relevance = 0 (judge rejects for being off-topic).
  • C: 17 supported out of 18 total → precision ≈ 17/19≈0.90; detail = log(18); relevance = 1 (judge prefers this to base).

What different reward designs learn

  • Precision-only (λ=0, μ=0): picks B. The model learns to pad with generic but factual content—high precision, low usefulness. This is the reward hacking the paper warns about.
  • Precision + detail (λ>0, μ=0): still picks B. Reward hacking gets worse: the model adds even more off-topic “safe facts” to boost detail and keep precision high.
  • Precision + relevance (λ=0, μ>0): picks C. The judge term breaks the hack—off-topic padding no longer pays; the model focuses on accurate, on-topic details.
  • Full reward (λ>0, μ>0): still picks C, but you can tune λ to trade a bit of precision for more verified, relevant detail (as shown in the ablations).

Why this example matters

  • It captures the core failure mode: without a relevance term, models can look “more factual” by going off-topic.
  • It shows how the composite reward fixes that: you get fewer hallucinations (+23.1 precision points), more supported facts (+23%), and no loss in helpfulness (54.4% win rate vs base).
  • And it only works at scale because they sped up long-form factuality checking from minutes to ~5 seconds, enabling on-policy online RL (GRPO) to train on these signals in real time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment