Scoring Variance Experiment — Model Comparison Report

Project: Learco Personal, Project 1409 (Graphisoft CEO search) Date: 2026-04-04 Candidates: 359 with full LinkedIn profiles Scoring factors: 6, all on 0-5 scale (total max: 30) Runs per experiment: 5 Concurrency: 32 parallel requests (10 for Anthropic due to latency) Prompt: Identical across all experiments (exported via experiment:export-prompts using production CandidateScoringPromptService) Dealroom enrichment: Skipped (to eliminate external API nondeterminism)

1. Experiment Setup

What we measured

For each candidate × each factor × each run, the model returns a score (0-5) and reasoning. We define a candidate as unstable on a factor if max(scores) - min(scores) >= 2 across 5 runs. On a 0-5 scale, spread of 2 = 40% of the scale — this is not ±1 noise, it's a meaningful disagreement.

Scoring factors tested

ID	Factor	Max	Type
5968	International Revenue Growth Track Record	5	CRITICAL
5969	Industry & Market Relevance	5	CRITICAL
5970	Seniority & Scope	5	SUPPORTING
5971	CEO Readiness & Subsidiary Leadership	5	SUPPORTING
5972	SaaS & Subscription Leadership	5	SUPPORTING
5973	Architecture Domain Affinity	5	SUPPORTING

Prompt structure

All experiments used the same prompt per candidate, built by PHP CandidateScoringPromptService::generateScoringPrompt(). Prompt includes:

System instructions (role, JSON format, scoring rules)
Search specification document (plain text, HTML stripped)
Custom researcher note prompt (advanced format with company info)
All 6 factor definitions with rubrics, edge case rules, and ZERO GUARD instruction
Target company data (all data points from company_data table)
Full candidate LinkedIn profile JSON (~50-150K chars)
Accuracy check instruction referencing all data point names
Today's date

Average prompt length: ~160,000 characters per candidate.

What varied between experiments

Variable	Baseline	Exp A	Exp B	Exp C	Exp D
Model	grok-4-1-fast-non-reasoning	grok-4-1-fast-reasoning	grok-4-1-fast-reasoning	grok-4.20-0309-non-reasoning	claude-sonnet-4-6
Provider	xAI	xAI	xAI	xAI	Anthropic
Prompt change	—	—	+ deep analysis instruction	—	—
Output format	json_schema	json_schema	json_schema	json_schema	direct JSON (no tool_use)

2. Experiments Conducted

Baseline: grok-4-1-fast-non-reasoning

Model: grok-4-1-fast-non-reasoning (current production scoring model)
Provider: xAI
API: v1/chat/completions with response_format: json_schema
Cost: ~$0.011 per candidate ($0.20/M input, $0.50/M output)
Speed: ~4-5 req/s at 32 concurrency
Total time: ~8 min for 5 runs × 359 candidates
Results saved: experiments/results/experiment-baseline-41fast-nonreasoning/

Experiment A: grok-4-1-fast-reasoning

Model: grok-4-1-fast-reasoning (xAI reasoning model)
Provider: xAI
API: v1/chat/completions with response_format: json_schema
Cost: ~$0.011 per candidate ($0.20/M input, $0.50/M output)
Speed: ~0.7-1.1 req/s at 32 concurrency
Total time: ~30 min for 5 runs × 359 candidates
Results saved: experiments/results/experiment-a-reasoning-only/

Experiment B: grok-4-1-fast-reasoning + deep analysis prompt

Model: grok-4-1-fast-reasoning (same as Exp A)
Provider: xAI

Prompt modification: Added "Deep Profile Analysis Required" instruction after # Instructions:

## CRITICAL: Deep Profile Analysis Required
Before scoring ANY factor, you MUST perform a complete, thorough analysis
of the ENTIRE candidate profile below. This means:
1. Read EVERY work experience entry — do not skip any, even if the profile is long
2. Read ALL education entries, certifications, skills, and summary sections
3. For each factor, trace evidence across the FULL career history
4. Cross-reference job titles with company descriptions to understand actual scope
5. Look for indirect signals
Only after completing this full analysis should you begin scoring.

Cost: ~$0.15 per candidate
Speed: ~1.2-1.4 req/s at 32 concurrency
Total time: ~25 min for 5 runs × 359 candidates
Results saved: experiments/results/experiment-b-reasoning-deep-prompt/

Experiment C: grok-4.20-0309-non-reasoning

Model: grok-4.20-0309-non-reasoning (xAI's newer, more capable non-reasoning model)
Provider: xAI
API: Same endpoint, same schema, same prompt as Baseline
Cost: ~$0.011 per candidate ($0.20/M input, $0.50/M output — same as all xAI models)
Speed: ~1.0-1.3 req/s at 32 concurrency
Total time: ~25 min for 5 runs × 359 candidates
Results saved: experiments/results/experiment-c-grok420-non-reasoning/

Experiment D: Claude Sonnet 4.6

Model: claude-sonnet-4-6 (Anthropic)
Provider: Anthropic
API: v1/messages — direct JSON output (no tool_use, to avoid overhead)
Cost: ~$0.15 per candidate ($3/M input, $15/M output)
Speed: ~0.05-0.1 req/s at 32 concurrency (very slow — high per-request latency on 160K prompts)
Total time: ~6 hours for 5 runs × 359 candidates
Results saved: experiments/results/experiment-d-sonnet46/

3. Results

Main comparison: % unstable candidates per factor (spread ≥ 2 across 5 runs)

Factor	4.1-fast non-reas	4.1-fast reasoning	4.1-fast reas+deep	grok-4.20 non-reas	Sonnet 4.6
International Revenue Growth	3.9% *	6.7%	7.0%	26.2%	9.5%
Industry & Market Relevance	8.9% *	15.6%	16.2%	17.5%	16.2%
Seniority & Scope	1.4%	0.6%	0.3% *	0.8%	9.5%
CEO Readiness & Subsidiary Leadership	5.8%	4.2%	5.3%	2.5% *	10.3%
SaaS & Subscription Leadership	1.1% *	2.8%	2.2%	8.4%	9.5%
Architecture Domain Affinity	10.9%	14.8%	10.3% *	14.5%	13.4%

Bold * = best (lowest instability) for that factor.

Cost and speed comparison

All xAI models share the same pricing: $0.20/M input, $0.50/M output ≈ $0.011/candidate.

Model	Cost/candidate	Throughput (32 conc.)	Time for 5×359	Factors won
grok-4-1-fast-non-reasoning	$0.011	~4-5 req/s	~8 min	3
grok-4-1-fast-reasoning	$0.011	~0.7-1.1 req/s	~30 min	0
grok-4-1-fast-reasoning+deep	$0.011	~1.2-1.4 req/s	~25 min	2
grok-4.20-0309-non-reasoning	$0.011	~1.0-1.3 req/s	~25 min	1
claude-sonnet-4-6	$0.170	~0.05-0.1 req/s	~6 hours	0

Total experiment cost

xAI pricing: $0.20/M input, $0.50/M output (same for all grok models, per ProjectCostService.php). Anthropic Sonnet 4.6 pricing: $3/M input, $15/M output. Average prompt: ~49K input tokens, ~1.5K output tokens per call.

Experiment	Runs × Candidates	Cost/call	Est. Cost
Baseline (4.1-fast-non-reas)	5 × 359	$0.011	~$19
Exp A (4.1-fast-reasoning)	5 × 359	$0.011	~$19
Exp B (4.1-fast-reas+deep)	5 × 359	$0.011	~$19
Exp C (grok-4.20-non-reas)	5 × 359	$0.011	~$19
Exp D (Sonnet 4.6)	5 × 359	$0.170	~$306
Total experiment	8,975 calls		~$382

4. Analysis

Finding 1: The simplest, fastest, cheapest model is the most stable

grok-4-1-fast-non-reasoning wins on 3 of 6 factors and is competitive on the other 3. It is:

22x cheaper than reasoning models and Sonnet ($0.007 vs $0.15)
50-100x faster than Sonnet 4.6
3-5x faster than other xAI models

This is counter-intuitive. You would expect a more capable model or a reasoning model to be more consistent. The opposite is true.

Finding 2: Reasoning does not improve stability

Comparing grok-4-1-fast-non-reasoning vs grok-4-1-fast-reasoning:

Industry & Market Relevance: 8.9% → 15.6% (nearly doubled, worse)
Architecture Domain Affinity: 10.9% → 14.8% (worse)
International Revenue Growth: 3.9% → 6.7% (worse)
CEO Readiness: 5.8% → 4.2% (slightly better)

Reasoning adds ~3x latency, 22x cost, and makes most factors less stable.

Finding 3: Deep analysis prompt does not help

Comparing grok-4-1-fast-reasoning vs grok-4-1-fast-reasoning+deep:

Numbers are within ±2 percentage points on every factor
No systematic improvement
The prompt instruction to "read the entire profile" had no measurable effect

Finding 4: A more capable model (grok-4.20) is significantly less stable

grok-4.20-0309-non-reasoning is dramatically worse on key factors:

International Revenue Growth: 3.9% → 26.2% (6.7x worse than baseline)
SaaS & Subscription: 1.1% → 8.4% (7.6x worse)
Industry & Market: 8.9% → 17.5% (2x worse)

Only CEO Readiness improved (5.8% → 2.5%).

Finding 5: Anthropic Sonnet 4.6 is the worst performer

Sonnet 4.6 is unstable on every factor — 9.5% to 16.2%. It wins on zero factors. Specific results:

Seniority & Scope: 9.5% vs 1.4% baseline (6.8x worse — a factor that all xAI models handle well)
CEO Readiness: 10.3% vs 5.8% baseline (1.8x worse)
SaaS & Subscription: 9.5% vs 1.1% baseline (8.6x worse)

Additionally, Sonnet 4.6 is extremely slow on 160K char prompts: ~0.05-0.1 req/s effective throughput, making a 5-run experiment take ~6 hours vs ~8 minutes for baseline.

Finding 6: Two factors are inherently unstable regardless of model

Across all 5 experiments, these two factors consistently show the highest instability:

Industry & Market Relevance: 8.9% - 17.5% unstable (never below 8.9%)
Architecture Domain Affinity: 10.3% - 14.5% unstable (never below 10.3%)

These factors require subjective interpretation of career history relevance to the AEC/architecture domain. The model's assessment genuinely varies because the evidence is ambiguous for ~10-15% of candidates. No model or prompt change fixes this.

Finding 7: Four factors are reliably stable on baseline model

On grok-4-1-fast-non-reasoning:

Seniority & Scope: 1.4% unstable
SaaS & Subscription Leadership: 1.1% unstable
International Revenue Growth: 3.9% unstable
CEO Readiness: 5.8% unstable

These numbers degrade significantly on other models, proving these factors are well-written — the instability on other models is a model problem, not a factor problem.

5. Hypotheses for why this happens

Why is the fast non-reasoning model the most stable?

Hypothesis: Less "thinking" = less variance on borderline cases.

A non-reasoning model maps input → output more deterministically. It pattern-matches against the scoring rubric without deliberation. When a candidate's profile is borderline on a factor (e.g., "some AEC exposure but indirect"), the fast model consistently picks the same bucket.

A reasoning model deliberates. On each run, the chain of thought may explore different aspects of the profile, reach different intermediate conclusions, and therefore arrive at different scores. More reasoning = more paths = more variance.

Why is grok-4.20 the worst among xAI models?

Hypothesis: Larger models have wider output distributions.

grok-4.20 is a more capable model — it sees more nuance, considers more angles, and has a richer internal representation. This is great for open-ended tasks but harmful for consistency on structured scoring. When the model "understands more," it also "second-guesses more."

The International Revenue Growth factor is the clearest example: it requires specific numerical evidence (25%+ growth, €30M+ revenue, cold market entry). grok-4.20 apparently interprets "cold market entry" more broadly on some runs than others, leading to 26.2% instability.

Why is Sonnet 4.6 the worst overall?

Hypothesis: Different architecture + no native json_schema enforcement.

Two factors likely contribute:

No json_schema mode. xAI models use response_format: json_schema which constrains output to the exact schema. Sonnet receives a text prompt asking for JSON — more degrees of freedom in how it structures the response, potentially affecting how it allocates attention to different factors.
Different training distribution. Sonnet is optimized for general-purpose tasks. xAI's grok models may be better calibrated for structured scoring due to different fine-tuning priorities.

Why doesn't the deep analysis prompt help?

Hypothesis: The model already reads the profile; the problem is interpretation, not reading.

The "lazy zero" pattern from earlier experiments (model says "insufficient information" when data is present) was largely fixed by the Score 0 redefinition. The remaining instability is not about missing data — it's about how the model weighs ambiguous evidence. Telling it to "read more carefully" doesn't change how it interprets what it finds.

6. Conclusions

Stick with grok-4-1-fast-non-reasoning for production scoring. It is the fastest and most stable model tested. All xAI models cost the same ($0.011/candidate), so speed is the differentiator — and fast-non-reasoning is 3-5x faster.
Do not switch to reasoning models for scoring. Same cost, 3-5x slower, and less consistent. Reasoning helps on open-ended tasks, not on structured scoring with clear rubrics.
Do not switch to grok-4.20. Despite being "smarter," it is dramatically less stable on 4 of 6 factors. The International Revenue Growth factor goes from 3.9% to 26.2% unstable.
Do not use Anthropic models for scoring. Sonnet 4.6 is the worst performer on every metric: slowest (50-100x), 15x more expensive ($0.17 vs $0.011), and least stable (wins zero factors). This may change with future model versions or native json_schema support.
Two factors (Industry & Market, Architecture Domain) will always have ~10% unstable candidates. This is inherent to the ambiguity of matching career history to domain expertise. No model or prompt fixes this. Options:
- Accept it and flag borderline candidates for human review
- Run scoring 3 times and take the median (eliminates outliers, 3x cost — still only $0.02/candidate)
- Rework factor criteria to reduce subjective interpretation (may reduce scoring quality)
Prompt engineering has minimal impact on stability. The "deep analysis" instruction and Score 0 redefinition did not produce meaningful improvements once measured with the correct metric (% unstable candidates).
The correct metric is % unstable candidates, not "flipper rate" or "mean spread". Mean spread masks individual outliers. Flipper rate (0 vs non-0) is too narrow. Percentage of candidates with spread ≥ 40% of the scale captures what actually matters: how many candidates get unreliable scores.

7. Files

Path	Description
`results/experiment-baseline-41fast-nonreasoning/`	Baseline: 5 runs, grok-4-1-fast-non-reasoning
`results/experiment-a-reasoning-only/`	Exp A: 5 runs, grok-4-1-fast-reasoning
`results/experiment-b-reasoning-deep-prompt/`	Exp B: 5 runs, grok-4-1-fast-reasoning + deep prompt
`results/experiment-c-grok420-non-reasoning/`	Exp C: 5 runs, grok-4.20-0309-non-reasoning
`results/experiment-d-sonnet46/`	Exp D: 5 runs, claude-sonnet-4-6
`results/experiment-report-iteration-1.md`	Earlier report on Score 0 redefinition
`data/project-context.json`	Current factor definitions (with ZERO GUARD)
`data/project-context-original.json`	Original factor definitions (backup)

iamarsenibragimov/experiment-report-model-comparison.md

Select an option

No results found