Start With the Judge: A Practical Blueprint for Dataset Engineering
TL;DR: Most LLM “training” work is really dataset engineering: defining the task, crafting a tiny set of crystalline examples, and building a reliable judge that can score outputs. If you start by perfecting the judge and then use it to drive generation, selection, and reinforcement learning—plus a few pragmatic guardrails—you can turn a tinkery, manual grind into a repeatable pipeline (and eventually, an automated agentic system).
The pattern today (and why it’s exhausting)
You notice a recurring failure: maybe the model keeps botching a class of SQL problems (“write a query that does X with window functions”), or it can’t follow a bespoke DSL, or it slips out of character in voice-constrained writing. The current playbook looks like this:
-
Write a tight prompt describing rules, scope, and pitfalls.
-
Craft 3–5 “seed exemplars.” These are hand-made, obsessively edited, diverse examples that capture the true shape of the task.
-
Few-shot the model. You paste those seeds into the conversation history and see if the model now behaves.
-
Iterate. You fix leaks in the prompt/spec, adjust examples, shuffle order, and repeat.
When that plateaus, you add a second system: a judge. You define what “good” looks like, create positive/negative examples, and prompt an LLM to evaluate model outputs. Eventually you connect the two: sample many candidate answers, score them with the judge, and pick or optimize the best. If you push further, you train with RL (e.g., PPO/GRPO) using the judge as the reward signal.
It works—but it’s painfully manual. The hardest part is often those first few examples and the never-ending triage that follows.
The leverage point most teams miss: start with the judge
If a scarce human hour can either:
hand-label one more training example that will be used once, or
sharpen the judge that will grade thousands of examples,
you should almost always invest in the judge. A calibrated judge lets you:
Scale cheaply: Generate a large candidate set and keep only high scorers.
Optimize: Use RL to move the policy toward what the judge prefers.
Move fast: Try ablations, new prompts, or “mixes” and instantly see what improves.
What a good judge looks like
A useful judge isn’t just a sentiment score; it encodes a rubric with reasons:
Must-haves: correctness criteria, constraints, formatting.
Must-not-haves: hallucinations, off-policy behaviors (e.g., leaking internal tools).
Tie-breakers: clarity, efficiency, style consistency, creativity (where relevant).
Evidence: require the judge to quote the specific lines that justify its score.
Here’s a compact rubric sketch for SQL:
Rubric (SQL correctness & style)
- Correctness: Does result set match the described intent? (5 pts)
- Constraints: Uses required constructs (e.g., window fn), no prohibited joins. (3 pts)
- Safety: Avoids SELECT *. Handles nulls / duplicates if relevant. (2 pts)
- Clarity: Readable CTEs, sensible aliases, brief rationale. (2 pts) Return: total score /10 and 2–3 evidence bullets.
Spend real effort on negative examples that capture tempting shortcuts (reward hacking bait): e.g., trivial templates, ignoring edge cases, or superficial compliance that “looks right” but is wrong.
The minimal pipeline
-
Seed → Behave: Create 3–5 impeccable exemplars and a crisp prompt. Few-shot until you get baseline competence.
-
Judge → Calibrate: Build a judge with a rubric and its own exemplars. Do brief human spot checks—not to grade everything, but to confirm the judge’s preferences match yours and to fix drift.
-
Generate → Select: For each input, sample many candidates (temperature/decoding diversity). Score with the judge. Keep top-k.
-
Optimize → RL: Use the judged scores as a reward signal. Train with PPO/GRPO to “bake in” judged preferences. Keep a small human-in-the-loop audit: check pairwise preferences the judge asserts, and adjust rubric/examples if you see mistakes.
-
Guardrail → Audit: Add adversarial tests for reward hacking: templated poetry, spec-lawyering, style without substance, or brittle shortcuts. Periodically inject these into evaluation to ensure the judge penalizes them.
Use few-shot “mixes” as a lab before you fine-tune
Before spending on fine-tuning runs, simulate them with large few-shot blocks (e.g., 20–30 examples in context). Treat this as a proxy policy:
Ablate: remove or swap subsets of examples to see which cases carry the lift.
Permute: vary order and formatting; keep what’s robust.
Stress test: run your judge over dozens of mixes to find the best-performing recipe.
This gives you learning signals at near-zero cost and often reveals a simpler data recipe than you expected.
Keep the mix healthy (and avoid forgetting)
As your dataset grows, per-task weight shifts. Two common pitfalls:
Over-concentrating on one subtask because it’s easy to improve and judge well, starving others that need more coverage.
Catastrophic forgetting on general knowledge or earlier wins.
Countermeasures:
Maintain a task registry with target proportions and difficulty tags.
Track per-subtask win rate (vs. a frozen baseline), judge agreement with humans, and coverage (how many unique patterns each task has).
Run regression suites (held-out examples, public benchmarks) after every mix/RL update. Failing a gate reverts the change or reduces that task’s weight.
Blueprint for an automated “dataset engineer” agent
This is where the toil turns into tools. Your agent’s responsibilities:
-
Detect a weak subtask. From eval telemetry or error reports, cluster failures and nominate a subtask (e.g., “windowed dedupe queries”).
-
Author a judge-first rubric. Draft rubric + seed positive/negative examples. Request minimal human calibration (approve/revise rubric; skim 10 judged pairs).
-
Generate candidates & curate seeds. Sample diverse outputs, score with the judge, and surface a “shortlist” for human review (only borderline cases).
-
Run few-shot lab experiments. Build multiple 20–30 example mixes, ablate, and rank them with the judge. Keep the top mix and archive the rest.
-
Update the global mix. Adjust per-task weights to hit target proportions. If the global dataset grows, scale examples in heavier tasks to preserve influence.
-
Gate with regressions. Evaluate on held-out suites, safety/guardrail sets, and judge-vs-human agreement checks. If gates pass, proceed.
-
Optionally apply RL. Use judge scores as the reward signal, train, then re-run gates. Detect reward hacking by injecting adversarial tests.
-
Log & learn. Keep lineage: which rubric, which seeds, which mix produced the win. When a regression appears later, you’ll know what to revisit.
Key metrics dashboard
Win rate vs. baseline (overall and per subtask)
Judge–human agreement (%) on sampled pairs
Coverage/diversity (n-gram novelty, schema/edge-case variety)
Safety/guardrail pass rate
Oversight budget: human minutes per 1000 examples
Practical tips that save weeks
Write the negative examples first. It flushes out underspecifications in your rubric.
Force evidence in the judge’s rationale. “Why is this wrong?” with quotes prevents hand-wavy scoring.
Randomize, then fix. Shuffle few-shot orders during exploration; once you pick a mix, fix the order so improvements are attributable.
Treat prompts as versioned code. Check them in, diff them, and tie them to eval runs.
Prefer small, surgical edits to exemplars over grand rewrites; you’ll learn faster.
The endgame
LLM training today is an orchestration problem—less “train a giant model,” more “continuously curate data, preferences, and checks.” The fastest teams start with the judge, use it to power high-throughput selection and RL, and let an agent manage the grunt work: surfacing only the few decisions that genuinely need human taste.
Do that, and the work stops feeling like an endless, fragile tinkering session. It becomes a system—tight loops, measurable progress, and far more leverage on every minute you spend.