Eric Florenzano ericflo

Start With the Judge: A Practical Blueprint for Dataset Engineering

TL;DR: Most LLM “training” work is really dataset engineering: defining the task, crafting a tiny set of crystalline examples, and building a reliable judge that can score outputs. If you start by perfecting the judge and then use it to drive generation, selection, and reinforcement learning—plus a few pragmatic guardrails—you can turn a tinkery, manual grind into a repeatable pipeline (and eventually, an automated agentic system).

The pattern today (and why it’s exhausting)

You notice a recurring failure: maybe the model keeps botching a class of SQL problems (“write a query that does X with window functions”), or it can’t follow a bespoke DSL, or it slips out of character in voice-constrained writing. The current playbook looks like this:

NO MOCKS, NO FALLBACKS, FAIL FAST, FAIL LOUD

This codebase follows a strict policy:

🚫 NO MOCKS

Mock objects create false confidence
They diverge from production behavior
They hide real integration issues
They make debugging harder

Movie Title: The Chalice of Light

Length:

2 hours and 45 minutes

Supporting Evidence -- Reason for Virality:

Many people believe in various kinds of supernatural phenomena for a variety of reasons, reasons I believe are not yet totally understood by academics, and certainly not by the general public. However, the fact that people believe in these things is by itself an untapped power.

	#!/usr/bin/env python3
	# /// script
	# requires-python = ">=3.11"
	# dependencies = [
	# "datasets",
	# "numpy",
	# "requests",
	# "tqdm",
	# "pyarrow"
	# ]

	def disk_cache(cache_dir: str = ".cache"):
	"""
	Decorator that implements disk caching for functions using JSON.
	NOTE: This is adapted for async usage: we await the wrapped function,
	but the file I/O remains synchronous for simplicity.
	Args:
	cache_dir: Directory to store cache files
	"""

	def decorator(func):

	import copy
	import json
	import os
	import logging
	import random
	import traceback
	from concurrent.futures import ThreadPoolExecutor, as_completed
	from dataclasses import dataclass

	import fire

	# create_pairs.py
	import argparse
	import copy
	import json
	import random
	from tqdm import tqdm
	from datasets import load_dataset

	from vllm import LLM, SamplingParams
	from transformers import AutoTokenizer

	# Deploy 1x AMD MI300X
	# python -m vllm.entrypoints.openai.api_server --port 8083 --host 127.0.0.1 --model meta-llama/Meta-Llama-3.1-70B-Instruct --max-model-len 120000
	# NUM_WORKERS=32 MODEL_NAME="meta-llama/Meta-Llama-3.1-70B-Instruct" OPENAI_API_URL="http://127.0.0.1:8083/v1" python agent_instruction_database.py

	import copy
	import os
	import json
	import traceback
	import random
	from pprint import pprint

	# MODEL_NAME="meta-llama/Meta-Llama-3.1-8B-Instruct" OPENAI_API_URL="http://localhost:1234/v1" OPENAI_API_TOKEN="..." python general_function_calling.py
	# MODEL_NAME="meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo" OPENAI_API_URL="https://api.together.xyz/v1" OPENAI_API_TOKEN="$TOGETHER_API_KEY" python general_function_calling.py
	# MODEL_NAME="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo" OPENAI_API_URL="https://api.together.xyz/v1" OPENAI_API_TOKEN="$TOGETHER_API_KEY" python general_function_calling.py

	import copy
	import os
	import traceback
	import json
	import re
	import inspect