Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save iamarsenibragimov/c34cde5d5f170de721ef997458cd89db to your computer and use it in GitHub Desktop.

Select an option

Save iamarsenibragimov/c34cde5d5f170de721ef997458cd89db to your computer and use it in GitHub Desktop.
Scoring Variance Experiment — Model Comparison Report (5 models, 359 candidates, 5 runs each)

Scoring Variance Experiment — Model Comparison Report

Project: Learco Personal, Project 1409 (Graphisoft CEO search) Date: 2026-04-04 Candidates: 359 with full LinkedIn profiles Scoring factors: 6, all on 0-5 scale (total max: 30) Runs per experiment: 5 Concurrency: 32 parallel requests (10 for Anthropic due to latency) Prompt: Identical across all experiments (exported via experiment:export-prompts using production CandidateScoringPromptService) Dealroom enrichment: Skipped (to eliminate external API nondeterminism)


1. Experiment Setup

What we measured

For each candidate × each factor × each run, the model returns a score (0-5) and reasoning. We define a candidate as unstable on a factor if max(scores) - min(scores) >= 2 across 5 runs. On a 0-5 scale, spread of 2 = 40% of the scale — this is not ±1 noise, it's a meaningful disagreement.

Scoring factors tested

ID Factor Max Type
5968 International Revenue Growth Track Record 5 CRITICAL
5969 Industry & Market Relevance 5 CRITICAL
5970 Seniority & Scope 5 SUPPORTING
5971 CEO Readiness & Subsidiary Leadership 5 SUPPORTING
5972 SaaS & Subscription Leadership 5 SUPPORTING
5973 Architecture Domain Affinity 5 SUPPORTING

Prompt structure

All experiments used the same prompt per candidate, built by PHP CandidateScoringPromptService::generateScoringPrompt(). Prompt includes:

  • System instructions (role, JSON format, scoring rules)
  • Search specification document (plain text, HTML stripped)
  • Custom researcher note prompt (advanced format with company info)
  • All 6 factor definitions with rubrics, edge case rules, and ZERO GUARD instruction
  • Target company data (all data points from company_data table)
  • Full candidate LinkedIn profile JSON (~50-150K chars)
  • Accuracy check instruction referencing all data point names
  • Today's date

Average prompt length: ~160,000 characters per candidate.

What varied between experiments

Variable Baseline Exp A Exp B Exp C Exp D
Model grok-4-1-fast-non-reasoning grok-4-1-fast-reasoning grok-4-1-fast-reasoning grok-4.20-0309-non-reasoning claude-sonnet-4-6
Provider xAI xAI xAI xAI Anthropic
Prompt change + deep analysis instruction
Output format json_schema json_schema json_schema json_schema direct JSON (no tool_use)

2. Experiments Conducted

Baseline: grok-4-1-fast-non-reasoning

  • Model: grok-4-1-fast-non-reasoning (current production scoring model)
  • Provider: xAI
  • API: v1/chat/completions with response_format: json_schema
  • Cost: ~$0.011 per candidate ($0.20/M input, $0.50/M output)
  • Speed: ~4-5 req/s at 32 concurrency
  • Total time: ~8 min for 5 runs × 359 candidates
  • Results saved: experiments/results/experiment-baseline-41fast-nonreasoning/

Experiment A: grok-4-1-fast-reasoning

  • Model: grok-4-1-fast-reasoning (xAI reasoning model)
  • Provider: xAI
  • API: v1/chat/completions with response_format: json_schema
  • Cost: ~$0.011 per candidate ($0.20/M input, $0.50/M output)
  • Speed: ~0.7-1.1 req/s at 32 concurrency
  • Total time: ~30 min for 5 runs × 359 candidates
  • Results saved: experiments/results/experiment-a-reasoning-only/

Experiment B: grok-4-1-fast-reasoning + deep analysis prompt

  • Model: grok-4-1-fast-reasoning (same as Exp A)
  • Provider: xAI
  • Prompt modification: Added "Deep Profile Analysis Required" instruction after # Instructions:
    ## CRITICAL: Deep Profile Analysis Required
    Before scoring ANY factor, you MUST perform a complete, thorough analysis
    of the ENTIRE candidate profile below. This means:
    1. Read EVERY work experience entry — do not skip any, even if the profile is long
    2. Read ALL education entries, certifications, skills, and summary sections
    3. For each factor, trace evidence across the FULL career history
    4. Cross-reference job titles with company descriptions to understand actual scope
    5. Look for indirect signals
    Only after completing this full analysis should you begin scoring.
    
  • Cost: ~$0.15 per candidate
  • Speed: ~1.2-1.4 req/s at 32 concurrency
  • Total time: ~25 min for 5 runs × 359 candidates
  • Results saved: experiments/results/experiment-b-reasoning-deep-prompt/

Experiment C: grok-4.20-0309-non-reasoning

  • Model: grok-4.20-0309-non-reasoning (xAI's newer, more capable non-reasoning model)
  • Provider: xAI
  • API: Same endpoint, same schema, same prompt as Baseline
  • Cost: ~$0.011 per candidate ($0.20/M input, $0.50/M output — same as all xAI models)
  • Speed: ~1.0-1.3 req/s at 32 concurrency
  • Total time: ~25 min for 5 runs × 359 candidates
  • Results saved: experiments/results/experiment-c-grok420-non-reasoning/

Experiment D: Claude Sonnet 4.6

  • Model: claude-sonnet-4-6 (Anthropic)
  • Provider: Anthropic
  • API: v1/messages — direct JSON output (no tool_use, to avoid overhead)
  • Cost: ~$0.15 per candidate ($3/M input, $15/M output)
  • Speed: ~0.05-0.1 req/s at 32 concurrency (very slow — high per-request latency on 160K prompts)
  • Total time: ~6 hours for 5 runs × 359 candidates
  • Results saved: experiments/results/experiment-d-sonnet46/

3. Results

Main comparison: % unstable candidates per factor (spread ≥ 2 across 5 runs)

Factor 4.1-fast non-reas 4.1-fast reasoning 4.1-fast reas+deep grok-4.20 non-reas Sonnet 4.6
International Revenue Growth 3.9% * 6.7% 7.0% 26.2% 9.5%
Industry & Market Relevance 8.9% * 15.6% 16.2% 17.5% 16.2%
Seniority & Scope 1.4% 0.6% 0.3% * 0.8% 9.5%
CEO Readiness & Subsidiary Leadership 5.8% 4.2% 5.3% 2.5% * 10.3%
SaaS & Subscription Leadership 1.1% * 2.8% 2.2% 8.4% 9.5%
Architecture Domain Affinity 10.9% 14.8% 10.3% * 14.5% 13.4%

Bold * = best (lowest instability) for that factor.

Cost and speed comparison

All xAI models share the same pricing: $0.20/M input, $0.50/M output ≈ $0.011/candidate.

Model Cost/candidate Throughput (32 conc.) Time for 5×359 Factors won
grok-4-1-fast-non-reasoning $0.011 ~4-5 req/s ~8 min 3
grok-4-1-fast-reasoning $0.011 ~0.7-1.1 req/s ~30 min 0
grok-4-1-fast-reasoning+deep $0.011 ~1.2-1.4 req/s ~25 min 2
grok-4.20-0309-non-reasoning $0.011 ~1.0-1.3 req/s ~25 min 1
claude-sonnet-4-6 $0.170 ~0.05-0.1 req/s ~6 hours 0

Total experiment cost

xAI pricing: $0.20/M input, $0.50/M output (same for all grok models, per ProjectCostService.php). Anthropic Sonnet 4.6 pricing: $3/M input, $15/M output. Average prompt: ~49K input tokens, ~1.5K output tokens per call.

Experiment Runs × Candidates Cost/call Est. Cost
Baseline (4.1-fast-non-reas) 5 × 359 $0.011 ~$19
Exp A (4.1-fast-reasoning) 5 × 359 $0.011 ~$19
Exp B (4.1-fast-reas+deep) 5 × 359 $0.011 ~$19
Exp C (grok-4.20-non-reas) 5 × 359 $0.011 ~$19
Exp D (Sonnet 4.6) 5 × 359 $0.170 ~$306
Total experiment 8,975 calls ~$382

4. Analysis

Finding 1: The simplest, fastest, cheapest model is the most stable

grok-4-1-fast-non-reasoning wins on 3 of 6 factors and is competitive on the other 3. It is:

  • 22x cheaper than reasoning models and Sonnet ($0.007 vs $0.15)
  • 50-100x faster than Sonnet 4.6
  • 3-5x faster than other xAI models

This is counter-intuitive. You would expect a more capable model or a reasoning model to be more consistent. The opposite is true.

Finding 2: Reasoning does not improve stability

Comparing grok-4-1-fast-non-reasoning vs grok-4-1-fast-reasoning:

  • Industry & Market Relevance: 8.9% → 15.6% (nearly doubled, worse)
  • Architecture Domain Affinity: 10.9% → 14.8% (worse)
  • International Revenue Growth: 3.9% → 6.7% (worse)
  • CEO Readiness: 5.8% → 4.2% (slightly better)

Reasoning adds ~3x latency, 22x cost, and makes most factors less stable.

Finding 3: Deep analysis prompt does not help

Comparing grok-4-1-fast-reasoning vs grok-4-1-fast-reasoning+deep:

  • Numbers are within ±2 percentage points on every factor
  • No systematic improvement
  • The prompt instruction to "read the entire profile" had no measurable effect

Finding 4: A more capable model (grok-4.20) is significantly less stable

grok-4.20-0309-non-reasoning is dramatically worse on key factors:

  • International Revenue Growth: 3.9% → 26.2% (6.7x worse than baseline)
  • SaaS & Subscription: 1.1% → 8.4% (7.6x worse)
  • Industry & Market: 8.9% → 17.5% (2x worse)

Only CEO Readiness improved (5.8% → 2.5%).

Finding 5: Anthropic Sonnet 4.6 is the worst performer

Sonnet 4.6 is unstable on every factor — 9.5% to 16.2%. It wins on zero factors. Specific results:

  • Seniority & Scope: 9.5% vs 1.4% baseline (6.8x worse — a factor that all xAI models handle well)
  • CEO Readiness: 10.3% vs 5.8% baseline (1.8x worse)
  • SaaS & Subscription: 9.5% vs 1.1% baseline (8.6x worse)

Additionally, Sonnet 4.6 is extremely slow on 160K char prompts: ~0.05-0.1 req/s effective throughput, making a 5-run experiment take ~6 hours vs ~8 minutes for baseline.

Finding 6: Two factors are inherently unstable regardless of model

Across all 5 experiments, these two factors consistently show the highest instability:

  • Industry & Market Relevance: 8.9% - 17.5% unstable (never below 8.9%)
  • Architecture Domain Affinity: 10.3% - 14.5% unstable (never below 10.3%)

These factors require subjective interpretation of career history relevance to the AEC/architecture domain. The model's assessment genuinely varies because the evidence is ambiguous for ~10-15% of candidates. No model or prompt change fixes this.

Finding 7: Four factors are reliably stable on baseline model

On grok-4-1-fast-non-reasoning:

  • Seniority & Scope: 1.4% unstable
  • SaaS & Subscription Leadership: 1.1% unstable
  • International Revenue Growth: 3.9% unstable
  • CEO Readiness: 5.8% unstable

These numbers degrade significantly on other models, proving these factors are well-written — the instability on other models is a model problem, not a factor problem.


5. Hypotheses for why this happens

Why is the fast non-reasoning model the most stable?

Hypothesis: Less "thinking" = less variance on borderline cases.

A non-reasoning model maps input → output more deterministically. It pattern-matches against the scoring rubric without deliberation. When a candidate's profile is borderline on a factor (e.g., "some AEC exposure but indirect"), the fast model consistently picks the same bucket.

A reasoning model deliberates. On each run, the chain of thought may explore different aspects of the profile, reach different intermediate conclusions, and therefore arrive at different scores. More reasoning = more paths = more variance.

Why is grok-4.20 the worst among xAI models?

Hypothesis: Larger models have wider output distributions.

grok-4.20 is a more capable model — it sees more nuance, considers more angles, and has a richer internal representation. This is great for open-ended tasks but harmful for consistency on structured scoring. When the model "understands more," it also "second-guesses more."

The International Revenue Growth factor is the clearest example: it requires specific numerical evidence (25%+ growth, €30M+ revenue, cold market entry). grok-4.20 apparently interprets "cold market entry" more broadly on some runs than others, leading to 26.2% instability.

Why is Sonnet 4.6 the worst overall?

Hypothesis: Different architecture + no native json_schema enforcement.

Two factors likely contribute:

  1. No json_schema mode. xAI models use response_format: json_schema which constrains output to the exact schema. Sonnet receives a text prompt asking for JSON — more degrees of freedom in how it structures the response, potentially affecting how it allocates attention to different factors.
  2. Different training distribution. Sonnet is optimized for general-purpose tasks. xAI's grok models may be better calibrated for structured scoring due to different fine-tuning priorities.

Why doesn't the deep analysis prompt help?

Hypothesis: The model already reads the profile; the problem is interpretation, not reading.

The "lazy zero" pattern from earlier experiments (model says "insufficient information" when data is present) was largely fixed by the Score 0 redefinition. The remaining instability is not about missing data — it's about how the model weighs ambiguous evidence. Telling it to "read more carefully" doesn't change how it interprets what it finds.


6. Conclusions

  1. Stick with grok-4-1-fast-non-reasoning for production scoring. It is the fastest and most stable model tested. All xAI models cost the same ($0.011/candidate), so speed is the differentiator — and fast-non-reasoning is 3-5x faster.

  2. Do not switch to reasoning models for scoring. Same cost, 3-5x slower, and less consistent. Reasoning helps on open-ended tasks, not on structured scoring with clear rubrics.

  3. Do not switch to grok-4.20. Despite being "smarter," it is dramatically less stable on 4 of 6 factors. The International Revenue Growth factor goes from 3.9% to 26.2% unstable.

  4. Do not use Anthropic models for scoring. Sonnet 4.6 is the worst performer on every metric: slowest (50-100x), 15x more expensive ($0.17 vs $0.011), and least stable (wins zero factors). This may change with future model versions or native json_schema support.

  5. Two factors (Industry & Market, Architecture Domain) will always have ~10% unstable candidates. This is inherent to the ambiguity of matching career history to domain expertise. No model or prompt fixes this. Options:

    • Accept it and flag borderline candidates for human review
    • Run scoring 3 times and take the median (eliminates outliers, 3x cost — still only $0.02/candidate)
    • Rework factor criteria to reduce subjective interpretation (may reduce scoring quality)
  6. Prompt engineering has minimal impact on stability. The "deep analysis" instruction and Score 0 redefinition did not produce meaningful improvements once measured with the correct metric (% unstable candidates).

  7. The correct metric is % unstable candidates, not "flipper rate" or "mean spread". Mean spread masks individual outliers. Flipper rate (0 vs non-0) is too narrow. Percentage of candidates with spread ≥ 40% of the scale captures what actually matters: how many candidates get unreliable scores.


7. Files

Path Description
results/experiment-baseline-41fast-nonreasoning/ Baseline: 5 runs, grok-4-1-fast-non-reasoning
results/experiment-a-reasoning-only/ Exp A: 5 runs, grok-4-1-fast-reasoning
results/experiment-b-reasoning-deep-prompt/ Exp B: 5 runs, grok-4-1-fast-reasoning + deep prompt
results/experiment-c-grok420-non-reasoning/ Exp C: 5 runs, grok-4.20-0309-non-reasoning
results/experiment-d-sonnet46/ Exp D: 5 runs, claude-sonnet-4-6
results/experiment-report-iteration-1.md Earlier report on Score 0 redefinition
data/project-context.json Current factor definitions (with ZERO GUARD)
data/project-context-original.json Original factor definitions (backup)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment