Generating Synthetic Data for LLM Evaluation

Summary

Use your application extensively to build intuition about failure modes
Define 3-4 dimensions based on observed or anticipated failures
Create structured tuples covering your priority failure scenarios
Generate natural language queries from each tuple using a separate LLM call
Scale to more examples across your most important failure hypotheses (we suggest at least ~100)
Test and iterate on the most critical failure modes first, and generate more until you reach theoretical saturation

Detail

Start with Real Usage, Not Synthetic Data

Before generating any synthetic data, use your application yourself. Try different scenarios, edge cases, and realistic workflows. If you can't use it extensively, recruit 2-3 people to test it while you observe their interactions.

Generate Data to Test Specific Hypotheses

Create synthetic data only when you have a clear hypothesis about your applications's failure modes. Synthetic data is most valuable for failure modes that:

Require systematic testing across many variations to understand the pattern
Occur infrequently but have high impact when they do occur
Involve complex interactions between multiple system components

Structure Generation with Dimensions To Maximize Diversity

When real user data is sparse, use structured generation rather than asking an LLM for "random queries."

Define 3-4 key dimensions that represent where your application is likely to fail. Here are example dimensions for a recipe bot:

Recipe Type: Main dish, dessert, snack, side dish
User Persona: Beginner cook, busy parent, fitness enthusiast, professional chef
Constraint Complexity: Single constraint, multiple constraints, conflicting constraints

After brainstorming dimensions, create tuples which are specific combinations of values for each dimension. For example:

(Main dish, Beginner cook, Single constraint)
(Dessert, Busy parent, Multiple constraints)

Finally, Generate queries from these tuples using a second LLM call.

Scale Based on Iteration Needs

Start with around 100 synthetic examples. Keep generating more examples until you reach "theoretical saturation" —the point where additional examples reveal few new failure modes.

hamelsmu/faq_2.md