Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save ntorga/573ace252077ff950ba07174b919f9e8 to your computer and use it in GitHub Desktop.

Select an option

Save ntorga/573ace252077ff950ba07174b919f9e8 to your computer and use it in GitHub Desktop.

Results Matrix — Lean Reviewer

Provider Budget Tier Time Verdict B/W/N Conf Key Findings
Kimi 8k T1 40s pass 0/1/3 4 Rubber stamp. No real findings. Self-signed cert defaults warning only.
Kimi 8k T2 44s fail 1/5/5 4 Caught math/rand in PasswordFactory (CWE-338). Quality warnings on naming.
Kimi 8k T3 55s fail 8/10/5 3 Hallucinated blocker on shellEscape quoting regex. Also caught math/rand and shell injection.
Kimi 16k T1 55s pass 0/0/2 5 Cleanest run in entire benchmark. Only 2 notes. Confidence 5.
Kimi 16k T2 48s fail 4/7/6 3 Caught math/rand + promoted CharsetPresenceGuarantor/Username/Mail as blockers.
Kimi 16k T3 72s partial-pass 0/5/3 4 Downgraded math/rand to note ("acceptable for test data"). Caught shell injection as warning.
Kimi 32k T1 49s fail 3/5/7 4 False blocker on shellEscape regex. Also speculative path traversal in readThrough.
Kimi 32k T2 33s fail 2/4/5 3 Caught math/rand. Blocker on trustedIpsReader DEBUG log severity.
Kimi 32k T3 71s fail 2/9/7 4 Caught math/rand + shell injection. Ran go test — constraint violation.
Qwen 8k T1 80s partial-pass 0/7/7 4 Conservative. Warned on regex, dead code, naming. No blockers.
Qwen 8k T2 126s fail 2/5/5 4 Caught math/rand. Also naming violation as blocker.
Qwen 8k T3 98s fail 8/9/5 4 Caught math/rand, fileClerk path traversal, shell injection, billion laughs. Wrote file — constraint violation.
Qwen 16k T1 105s partial-pass 0/5/8 4 No blockers. Flagged shellEscape regex gap and cert strength concerns.
Qwen 16k T2 105s fail 1/5/5 4 Caught math/rand. Wrote file — constraint violation.
Qwen 16k T3 111s fail 6/8/5 4 Caught math/rand, shell injection, path traversal, billion laughs. Quality blockers on test errors.
Qwen 32k T1 106s fail ~6/8/4 4 Security blockers on shellEscape regex. Many quality blockers on naming/test errors.
Qwen 32k T2 81s fail 4/6/5 4 Caught math/rand (primary). Also UsernameFactory/MailAddressFactory blockers.
Qwen 32k T3 94s fail 8+/8/5 4 Most comprehensive T3 run. Caught all certified findings. Ran go test — constraint violation.
Seed 8k T1 68s fail 8/8/8 4 Over-escalated coherence issues. shellEscape tilde/backslash CWE-78. readThrough path concat CWE-22.
Seed 8k T2 58s fail 8/5/3 5 Caught math/rand CWE-338. Over-fragmented into 2 blocker bullets. Also promoted error formatting, dnsLookup, trustedIps as blockers.
Seed 8k T3 86s pass 0/5/5 5 FALSE NEGATIVE. math/rand demoted to warning. Shell injection warning. fileClerk permissions warning. No billion laughs, no path traversal, no TOCTOU. Only false-pass in entire benchmark.
Seed 16k T1 88s fail 6/6/4 5 Dual-output anomaly (two review summaries). Backtick CWE-78. SplitSeq bypass concern.
Seed 16k T2 67s fail 4/5/4 5 Clean run. math/rand blocker. DSA deprecation CWE-327. assertOk naming blocker.
Seed 16k T3 145s fail 1/7/5 5 Delegated to Explore subagent. math/rand correctly maintained as blocker. Shell injection warning. Lean report.
Seed 32k T1 37s pass 0/0/3 5 Cleanest T1 in benchmark. Correctly contextualized backtick as no bypass risk. 37s fastest run.
Seed 32k T2 85s fail 2/5/5 4 math/rand blocker. Novel cert NotBefore clock skew finding. No DSA.
Seed 32k T3 137s fail 4/8/6 5 Strongest Seed T3. math/rand blocker. fileClerk symlink traversal (2 blockers). YAML deserialization CWE-502. Shell injection warning. Delegated to Explore subagent.

Results Matrix — Security

Provider Budget Tier Time Verdict B W N Key Findings
Kimi 8k T1 31s pass 0 T1 always passes; rubber stamp
Kimi 8k T2 43s fail 1 1 blocker
Kimi 8k T3 51s pass 0 Pass at T3
Kimi 12k T1 74s pass 0 Pass
Kimi 12k T2 47s fail 4 4 blockers
Kimi 12k T3 53s partial 0 Partial, 0 blockers
Kimi 16k T1 96s pass 0 T1 always passes
Kimi 16k T2 29s partial 4 Real injection findings
Kimi 16k T3 36s fail 4 Best Kimi T3; real injection findings
Kimi 48k T3 36s pass 0 Over-rationalized; 0 blockers at 48k
Qwen 8k T1 105s fail 1 1 blocker
Qwen 8k T2 67s fail 1 1 blocker
Qwen 8k T3 81s fail 7 LD_PRELOAD/PATH env injection
Qwen 12k T1 61s fail 2 2 blockers
Qwen 12k T2 68s partial 1 Partial, 1 blocker
Qwen 12k T3 119s fail 8 Consistent 8B at T3
Qwen 16k T1 55s pass 0 Correctly assessed shellEscape safe
Qwen 16k T2 43s fail 3 3 blockers
Qwen 16k T3 77s fail 8 Consistent 8B at T3
Qwen 32k T1 79s partial 0 Partial, 0 blockers
Qwen 32k T2 84s fail 2 2 blockers
Qwen 32k T3 83s fail 11 Strongest security result; 11 blockers
Haiku native T1 51s partial 0 Partial, 0 blockers
Haiku native T2 33s fail 2 Shell injection, TOCTOU, symlink
Haiku native T3 60s fail 5 source /etc/profile attack vector (novel)
Sonnet native T1 204s fail 2 PKI_DIR filepath.Abs exploit proof
Sonnet native T2 148s fail 2 DSA FIPS 186-5; clock-skew vulnerability
Sonnet native T3 161s fail 3 Arg injection via - prefix filenames
Seed 8k T3 72s fail 5 5 4 4 core blockers stable across budgets
Seed 16k T3 73s fail 5 5 4 Env var injection CWE-88 as 5th blocker
Seed 32k T3 254s fail 4 4 4 3.5× time increase; no accuracy gain

Results Matrix — Quality

Provider Budget Tier Time Verdict B W N Key Findings
Kimi 8k T1 41s pass 0 Rubber-stamped; 0 blockers
Kimi 8k T2 43s pass 0 Rubber-stamped; 0 blockers
Kimi 8k T3 80s pass 0 Rubber-stamped; 0 blockers
Kimi 12k T1 41s pass 0 Pass
Kimi 12k T2 110s fail 7 Peak Kimi quality; 7 blockers
Kimi 12k T3 85s fail 6 6 blockers
Kimi 16k T1 44s fail 12 Hyper-critical; mostly false positives
Kimi 16k T2 75s partial 4 4 blockers
Kimi 16k T3 24s fail ~3 ~3 blockers
Kimi 48k T3 43s fail ~15 Mechanical only; variable names, ordering
Qwen 8k T1 47s fail DNF
Qwen 8k T2 161s fail 8 8 blockers
Qwen 8k T3 226s fail ~100 Hyper-critical; ~100 FPs
Qwen 12k T1 63s fail ~25 ~25 blockers
Qwen 12k T2 53s fail ~20 ~20 blockers
Qwen 12k T3 91s fail 10 Good balance; 10 blockers
Qwen 16k T1 88s fail ~9 ~9 blockers
Qwen 16k T2 80s fail ~18 ~18 blockers
Qwen 16k T3 126s fail 10 10 blockers
Qwen 32k T1 91s fail ~5 ~5 blockers
Qwen 32k T2 97s fail 17 17 blockers
Qwen 32k T3 82s partial 0 Most calibrated; zero false positives
Haiku native T1 30s fail 7 7 blockers
Haiku native T2 38s fail 10 Applied prefix memory rule correctly
Haiku native T3 29s partial 4 Self-rated confidence 2-3/5 (honest)
Sonnet native T1 134s fail ~12 Methodical; precise rule citations
Sonnet native T2 163s fail ~14 Most methodical rule walk
Sonnet native T3 329s fail 11 Self-corrected 2 false positives mid-review
Seed 8k T3 121s fail 13 5 4 err single-letter FPs dominate (12/13B)
Seed 16k T3 166s fail 10 11 9 Diverse: else statements, panic, naming

Results Matrix — Coherence

Provider Budget Tier Time Verdict B W N Key Findings
Kimi 8k T1 48s pass 0 Pass
Kimi 8k T2 39s pass 0 Pass
Kimi 8k T3 60s fail 3 3 blockers
Kimi 12k T1 130s pass 0 Valley of doubt; over-rationalization
Kimi 12k T2 40s partial 0 Partial, 0 blockers
Kimi 12k T3 40s fail 9 Strongest single Kimi result; 9 blockers
Kimi 16k T1 56s partial 0 Partial, 0 blockers
Kimi 16k T2 74s pass 0 Pass
Kimi 16k T3 64s partial 4 4 blockers
Kimi 48k T3 48s pass 0 Complete rubber stamp; 0 blockers
Qwen 8k T1 75s fail 8 8 blockers
Qwen 8k T2 115s fail 2 2 blockers
Qwen 8k T3 172s fail 5 hostname -I whitespace parsing
Qwen 12k T1 78s fail 2 2 blockers
Qwen 12k T2 135s partial 1 Partial, 1 blocker
Qwen 12k T3 153s fail 11 Symlink rejection pattern; deserializer panic
Qwen 16k T1 80s fail 3 3 blockers
Qwen 16k T2 67s partial 4 Partial, 4 blockers
Qwen 16k T3 110s partial 4 Partial, 4 blockers
Qwen 32k T1 74s fail 4 4 blockers
Qwen 32k T2 116s partial 2 Partial, 2 blockers
Qwen 32k T3 171s partial 0 Self-corrected to 0 blockers
Haiku native T1 70s fail 4 4 blockers
Haiku native T2 123s fail 3 CharsetPresenceGuarantor len=2 bug
Haiku native T3 74s fail 10 stderr buffer loss when file-redirected
Sonnet native T1 144s fail 3 CNAME/NS trailing dot (novel)
Sonnet native T2 112s fail 3 CharsetPresenceGuarantor len=2 proven
Sonnet native T3 224s fail 3 Brotli --rm inconsistency; symlink asymmetry
Seed 8k T3 54s fail 6 5 4 Well-distributed: 6 distinct module issues
Seed 16k T3 147s fail 2 13 7 Severity downgrade; 4B→W, 8 new warnings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment