Skip to content

Instantly share code, notes, and snippets.

@am17an
Last active May 21, 2026 07:39
Show Gist options
  • Select an option

  • Save am17an/228edfb84ed082aa88e3865d6fa27090 to your computer and use it in GitHub Desktop.

Select an option

Save am17an/228edfb84ed082aa88e3865d6fa27090 to your computer and use it in GitHub Desktop.
MTP benchmark
#!/usr/bin/env python3
import argparse, json, sys, time
from urllib import request
PROMPTS = [
{"name": "code_python", "prompt": "Write a Python function that returns the n-th Fibonacci number using memoization. Include a docstring."},
{"name": "code_cpp", "prompt": "Write a C++ template function `clamp(x, lo, hi)` that returns x clamped to [lo, hi]. No std::clamp."},
{"name": "explain_concept", "prompt": "Explain how speculative decoding works in large language model inference, in three short paragraphs."},
{"name": "summarize", "prompt": "Summarize in two sentences: The Industrial Revolution began in Britain in the late 18th century, transforming manufacturing through mechanization, steam power, and the factory system. It spread to continental Europe and North America during the 19th century."},
{"name": "qa_factual", "prompt": "Q: What are the four fundamental forces of physics?\nA:"},
{"name": "translation", "prompt": "Translate to French: 'The quick brown fox jumps over the lazy dog.'"},
{"name": "creative_short", "prompt": "Write a four-line poem about an old lighthouse."},
{"name": "stepwise_math", "prompt": "Solve step by step: A train leaves station A at 60 km/h. Two hours later, a second train leaves the same station on the same track at 90 km/h. How long until the second train catches the first?"},
{"name": "long_code_review", "prompt": (
"You are reviewing a backend service that has been suffering intermittent latency spikes "
"in production. Below is the relevant code and a description of the system. After reading "
"carefully, produce a structured review with three sections: (1) likely root causes ranked "
"by probability, (2) concrete code or configuration changes you would make first, "
"(3) what telemetry you would add to confirm the diagnosis.\n\n"
"System description: a Python FastAPI service in front of a Postgres 15 database, deployed "
"as four replicas behind an nginx load balancer. Each request reads a user record, fetches "
"their last 50 events from a partitioned events table, computes an aggregate score, writes "
"the score back to the user row, and returns a JSON response. Average payload is 4 KB. "
"p50 latency is 35 ms; p99 spikes to 1.8 seconds approximately every 90 seconds in a "
"regular pattern. The spikes correlate with elevated Postgres CPU but not with elevated "
"Postgres connection count. The application pool is sized at 20 connections per replica. "
"PgBouncer is in front of Postgres in transaction pooling mode with a pool size of 50.\n\n"
"Code excerpt — the hot endpoint:\n"
"```python\n@app.post('/score/{user_id}')\nasync def score(user_id: int, payload: ScoreRequest):\n"
" async with db.transaction() as tx:\n user = await tx.fetchrow(\n"
" 'SELECT id, tier, last_score FROM users WHERE id = $1 FOR UPDATE',\n user_id,\n )\n"
" if user is None:\n raise HTTPException(404)\n events = await tx.fetch(\n"
" 'SELECT type, weight, ts FROM events '\n 'WHERE user_id = $1 ORDER BY ts DESC LIMIT 50',\n user_id,\n )\n"
" new_score = compute_score(user['tier'], events, payload.signals)\n"
" await tx.execute(\n 'UPDATE users SET last_score = $1, updated_at = now() WHERE id = $2',\n new_score, user_id,\n )\n"
" await tx.execute(\n 'INSERT INTO score_history (user_id, score, ts) VALUES ($1, $2, now())',\n user_id, new_score,\n )\n"
" await cache.set(f'score:{user_id}', new_score, ex=300)\n"
" metrics.histogram('score.latency_ms').observe((time.time() - start) * 1000)\n"
" return {'user_id': user_id, 'score': new_score}\n```\n\n"
"Schema notes: `users` is ~50M rows, `events` is partitioned by month with ~2B rows total "
"and a btree index on `(user_id, ts DESC)`. `score_history` is unpartitioned, ~800M rows, "
"with a single index on `user_id`. Postgres autovacuum is at default settings. There is "
"a nightly batch job that rebuilds materialized views starting at 02:00 UTC; spikes occur "
"throughout the day, not just during the batch window. Connection pooling metrics show "
"PgBouncer waiting connections occasionally hit 8-12 during spikes but never saturate. "
"CPU on the FastAPI replicas stays below 30% even during spikes. Network round-trip time "
"between the application and Postgres is consistently 0.4 ms.\n\nBegin your review now."
)},
]
def post(url, payload):
req = request.Request(url, data=json.dumps(payload).encode(), headers={"Content-Type":"application/json"}, method="POST")
with request.urlopen(req, timeout=300) as r:
return json.loads(r.read())
def run(args):
out = {"results": []}
for p in PROMPTS:
t0 = time.time()
r = post(f"{args.url}/v1/chat/completions", {
"model": "llama",
"messages": [{"role": "user", "content": p["prompt"]}],
"max_tokens": 192,
"seed": 42,
})
wall = time.time() - t0
# OpenAI-compatible endpoint: timings are in usage or top-level
usage = r.get("usage", {}) or {}
t = r.get("timings", {}) or {}
predicted_n = usage.get("completion_tokens") or t.get("predicted_n")
predicted_per_second = t.get("predicted_per_second") or (predicted_n / wall if wall > 0 else 0)
rec = {"name": p["name"], "wall_s": round(wall,3),
"predicted_n": predicted_n, "predicted_per_second": round(predicted_per_second, 2),
"draft_n": t.get("draft_n",0), "draft_n_accepted": t.get("draft_n_accepted",0)}
rec["accept_rate"] = round(rec["draft_n_accepted"]/rec["draft_n"],4) if rec["draft_n"] else None
out["results"].append(rec)
ar = f"{rec['accept_rate']:.3f}" if rec["accept_rate"] is not None else "n/a"
print(f" {rec['name']:<18} pred={rec['predicted_n']:>4} draft={rec['draft_n']:>4} acc={rec['draft_n_accepted']:>4} rate={ar} tok/s={rec['predicted_per_second']:.1f}")
td = sum(x["draft_n"] or 0 for x in out["results"])
ta = sum(x["draft_n_accepted"] or 0 for x in out["results"])
tp = sum(x["predicted_n"] or 0 for x in out["results"])
tw = sum(x["wall_s"] for x in out["results"])
out["aggregate"] = {"n_requests": len(out["results"]), "total_predicted": tp, "total_draft": td, "total_draft_accepted": ta,
"aggregate_accept_rate": round(ta/td,4) if td else None, "wall_s_total": round(tw,2)}
print("\nAggregate:", json.dumps(out["aggregate"], indent=2))
if args.out:
json.dump(out, open(args.out,"w"), indent=2); print("Wrote", args.out)
def diff(a, b):
A, B = json.load(open(a)), json.load(open(b))
print(f"{'metric':<24} {'A':>14} {'B':>14} {'delta':>10}")
for k in ("aggregate_accept_rate","total_predicted","total_draft","total_draft_accepted","wall_s_total"):
va, vb = A["aggregate"].get(k), B["aggregate"].get(k)
if va is None or vb is None: print(f"{k:<24} {str(va):>14} {str(vb):>14}"); continue
d = vb - va
s = f"{d:>+10.4f}" if isinstance(d,float) else f"{d:>+10}"
print(f"{k:<24} {va:>14} {vb:>14} {s}")
by_a = {x["name"]: x for x in A["results"]}
print("\n{:<20} {:>8} {:>8} {:>8}".format("prompt","A","B","delta"))
for rb in B["results"]:
ra = by_a.get(rb["name"]) or {}
ar = ra.get("accept_rate") or 0; br = rb.get("accept_rate") or 0
print(f"{rb['name']:<20} {ar:>8.3f} {br:>8.3f} {br-ar:>+8.3f}")
ap = argparse.ArgumentParser()
ap.add_argument("--url", default="http://127.0.0.1:8080")
ap.add_argument("--out")
ap.add_argument("--diff", nargs=2)
a = ap.parse_args()
if a.diff: diff(*a.diff)
else: run(a)

No MTP

./llama-server -m ../qwen3.6-q8_0.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.0
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.3
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.3
  summarize          pred=  53 draft=   0 acc=   0 rate=n/a tok/s=7.1
  qa_factual         pred= 177 draft=   0 acc=   0 rate=n/a tok/s=7.0
  translation        pred=  22 draft=   0 acc=   0 rate=n/a tok/s=7.7
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.1
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.2
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1404,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 201.07
}

MTP --spec-draft-max-n 3

./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 3

  code_python        pred= 192 draft= 153 acc= 139 rate=0.908 tok/s=21.6
  code_cpp           pred= 192 draft= 176 acc= 132 rate=0.750 tok/s=18.7
  explain_concept    pred= 192 draft= 191 acc= 126 rate=0.660 tok/s=16.3
  summarize          pred=  55 draft=  51 acc=  37 rate=0.726 tok/s=17.9
  qa_factual         pred= 177 draft= 174 acc= 118 rate=0.678 tok/s=16.5
  translation        pred=  22 draft=  24 acc=  13 rate=0.542 tok/s=13.9
  creative_short     pred= 192 draft= 200 acc= 123 rate=0.615 tok/s=15.8
  stepwise_math      pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=19.3
  long_code_review   pred= 192 draft= 179 acc= 131 rate=0.732 tok/s=18.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1406,
  "total_draft": 1319,
  "total_draft_accepted": 952,
  "aggregate_accept_rate": 0.7218,
  "wall_s_total": 83.8
}

MTP --spec-draft-max-n 2

./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 2

  code_python        pred= 192 draft= 134 acc= 123 rate=0.918 tok/s=17.4
  code_cpp           pred= 192 draft= 145 acc= 118 rate=0.814 tok/s=16.5
  explain_concept    pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=16.1
  summarize          pred=  55 draft=  44 acc=  32 rate=0.727 tok/s=15.6
  qa_factual         pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=18.2
  translation        pred=  22 draft=  18 acc=  12 rate=0.667 tok/s=15.2
  creative_short     pred= 192 draft= 149 acc= 116 rate=0.778 tok/s=16.1
  stepwise_math      pred= 192 draft= 139 acc= 121 rate=0.871 tok/s=17.2
  long_code_review   pred= 192 draft= 153 acc= 114 rate=0.745 tok/s=15.6

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1421,
  "total_draft": 1062,
  "total_draft_accepted": 877,
  "aggregate_accept_rate": 0.8258,
  "wall_s_total": 90.44
}

Draft model (Qwen3.5 0.8B) with spec-draft-n-max 16 with partial rollback

llama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 16 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft= 188 acc= 156 rate=0.830 tok/s=26.4
  code_cpp           pred= 192 draft= 201 acc= 126 rate=0.627 tok/s=16.8
  explain_concept    pred= 192 draft= 263 acc= 112 rate=0.426 tok/s=12.7
  summarize          pred=  57 draft=  63 acc=  39 rate=0.619 tok/s=16.9
  qa_factual         pred= 192 draft= 178 acc= 177 rate=0.994 tok/s=47.7
  translation        pred=  23 draft=  18 acc=  15 rate=0.833 tok/s=18.7
  creative_short     pred= 192 draft= 189 acc= 120 rate=0.635 tok/s=15.4
  stepwise_math      pred= 192 draft= 190 acc= 148 rate=0.779 tok/s=22.3
  long_code_review   pred= 192 draft= 207 acc= 120 rate=0.580 tok/s=14.5

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1424,
  "total_draft": 1497,
  "total_draft_accepted": 1013,
  "aggregate_accept_rate": 0.6767,
  "wall_s_total": 81.39
}

Master with draft model with spec-draft-n-max 64 with no partial rollback

llama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 64 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft= 174 acc= 159 rate=0.914 tok/s=27.2
  code_cpp           pred= 192 draft= 138 acc= 120 rate=0.870 tok/s=15.0
  explain_concept    pred= 192 draft= 170 acc= 101 rate=0.594 tok/s=11.4
  summarize          pred=  55 draft=  48 acc=  36 rate=0.750 tok/s=14.6
  qa_factual         pred= 177 draft= 126 acc= 106 rate=0.841 tok/s=13.9
  translation        pred=  22 draft=  13 acc=  13 rate=1.000 tok/s=16.5
  creative_short     pred= 192 draft= 136 acc= 104 rate=0.765 tok/s=12.8
  stepwise_math      pred= 192 draft= 172 acc= 147 rate=0.855 tok/s=22.0
  long_code_review   pred= 192 draft= 160 acc= 111 rate=0.694 tok/s=13.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1406,
  "total_draft": 1137,
  "total_draft_accepted": 897,
  "aggregate_accept_rate": 0.7889,
  "wall_s_total": 97.13
}
@yurkes111
Copy link
Copy Markdown

привет

@eabase
Copy link
Copy Markdown

eabase commented May 19, 2026

Nice! TY.

I was wondering when someone would start standardizing the prompts for TPS measures.
tok/s=<random> is totally meaningless without specifying the exact prompt used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment