Created
January 6, 2025 16:04
-
-
Save raymyers/f5737d781b917b6919ca836fa6e243f7 to your computer and use it in GitHub Desktop.
Interpreting SWE-bench scores
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# LinkedIn Post | |
# https://www.linkedin.com/posts/cadrlife_let-me-translate-the-o3-coding-benchmark-activity-7277038051590029313-yJlR?utm_source=share&utm_medium=member_desktop | |
# Slides using the minimalist tool Sent | |
# https://tools.suckless.org/sent/ | |
SWE-bench | |
SWE-bench measures AI coding agents | |
on realistic tasks from GitHub | |
Provides an incomplete | |
yet useful view | |
SOTA | |
SOTA = State Of The Art | |
50% to 55% | |
Verified Subset | |
Leaders | |
- Propietary: Amazon Q Developer | |
- Open Source: OpenHands CodeAct | |
- Updates on SWE-bench.com | |
SOTA Model | |
Claude-3.5 Sonnet | |
o1 is strong but slow | |
o3 will be insanely expensive | |
71.7% is reported | |
Misconceptions | |
1) Benchmarks gonna benchmark | |
Benchmarks do not compare | |
humans to machines | |
Human exams are not | |
calibrated for machines | |
One benchmark is never | |
the end of the line | |
Benchmarks become saturated | |
and we find new ones | |
2) Sampling | |
Score ≠ % of all tasks | |
Examples are from | |
OSS Python libraries | |
Selects for simple | |
well-defined tasks | |
3) Subsets | |
Subsets | |
- full | |
- lite | |
- verified | |
Currently, we mostly use | |
SWE-bench verified | |
4) Agents vs Models | |
SWE-bench is mostly | |
an agent benchmark | |
Most improvements came | |
from better agents | |
Model improvements have | |
made only minor difference | |
Claude 3.5 Sonnet is the best | |
These are all basically GPT-4 | |
5) Contamination | |
Konwinski prize |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment