Skip to content

Instantly share code, notes, and snippets.

@belisarius222
Created March 15, 2026 18:49
Show Gist options
  • Select an option

  • Save belisarius222/bc2e6bc6bc2478e9f75eb5d934a2ba09 to your computer and use it in GitHub Desktop.

Select an option

Save belisarius222/bc2e6bc6bc2478e9f75eb5d934a2ba09 to your computer and use it in GitHub Desktop.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Record Break β€” modded-nanogpt 57.38s on 8Γ—B200</title>
<style>
@import url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800&family=JetBrains+Mono:wght@400;500&display=swap');
:root {
--bg: #0d1117;
--bg-card: #161b22;
--bg-code: #1c2129;
--border: #30363d;
--text: #e6edf3;
--text-muted: #8b949e;
--text-dim: #6e7681;
--accent: #58a6ff;
--accent-glow: rgba(88, 166, 255, 0.15);
--green: #3fb950;
--green-dim: rgba(63, 185, 80, 0.15);
--red: #f85149;
--red-dim: rgba(248, 81, 73, 0.15);
--yellow: #d29922;
--purple: #bc8cff;
}
* { margin: 0; padding: 0; box-sizing: border-box; }
body {
font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
background: var(--bg);
color: var(--text);
line-height: 1.7;
font-size: 16px;
}
.hero {
text-align: center;
padding: 80px 20px 60px;
background: linear-gradient(180deg, #0d1117 0%, #111920 50%, #0d1117 100%);
border-bottom: 1px solid var(--border);
position: relative;
overflow: hidden;
}
.hero::before {
content: '';
position: absolute;
top: -50%;
left: 50%;
transform: translateX(-50%);
width: 800px;
height: 800px;
background: radial-gradient(circle, rgba(88,166,255,0.06) 0%, transparent 70%);
pointer-events: none;
}
.hero-badge {
display: inline-block;
background: var(--accent-glow);
border: 1px solid rgba(88,166,255,0.3);
color: var(--accent);
padding: 6px 16px;
border-radius: 20px;
font-size: 13px;
font-weight: 600;
letter-spacing: 0.5px;
text-transform: uppercase;
margin-bottom: 24px;
}
.hero h1 {
font-size: 3.2em;
font-weight: 800;
letter-spacing: -1.5px;
margin-bottom: 8px;
background: linear-gradient(135deg, #fff 0%, #58a6ff 100%);
-webkit-background-clip: text;
-webkit-text-fill-color: transparent;
background-clip: text;
}
.hero-time {
font-size: 5em;
font-weight: 800;
font-family: 'JetBrains Mono', monospace;
color: var(--green);
margin: 16px 0;
letter-spacing: -2px;
text-shadow: 0 0 60px rgba(63,185,80,0.3);
}
.hero-sub {
font-size: 1.2em;
color: var(--text-muted);
max-width: 600px;
margin: 0 auto;
}
.hero-meta {
display: flex;
justify-content: center;
gap: 32px;
margin-top: 32px;
flex-wrap: wrap;
}
.hero-meta-item {
text-align: center;
}
.hero-meta-item .label {
font-size: 12px;
text-transform: uppercase;
letter-spacing: 1px;
color: var(--text-dim);
margin-bottom: 4px;
}
.hero-meta-item .value {
font-size: 18px;
font-weight: 600;
font-family: 'JetBrains Mono', monospace;
}
.hero-meta-item .value.green { color: var(--green); }
.hero-meta-item .value.accent { color: var(--accent); }
.hero-meta-item .value.yellow { color: var(--yellow); }
.container {
max-width: 900px;
margin: 0 auto;
padding: 0 24px;
}
section {
padding: 48px 0;
border-bottom: 1px solid var(--border);
}
section:last-of-type { border-bottom: none; }
h2 {
font-size: 1.8em;
font-weight: 700;
margin-bottom: 24px;
letter-spacing: -0.5px;
display: flex;
align-items: center;
gap: 12px;
}
h2 .icon {
font-size: 0.8em;
}
h3 {
font-size: 1.25em;
font-weight: 600;
margin: 32px 0 16px;
color: var(--accent);
}
p {
margin-bottom: 16px;
color: var(--text-muted);
}
p strong { color: var(--text); }
a {
color: var(--accent);
text-decoration: none;
}
a:hover { text-decoration: underline; }
/* Links table */
.links-grid {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
gap: 12px;
margin: 24px 0;
}
.link-card {
background: var(--bg-card);
border: 1px solid var(--border);
border-radius: 8px;
padding: 16px;
transition: border-color 0.2s;
}
.link-card:hover { border-color: var(--accent); }
.link-card .link-label { font-size: 12px; color: var(--text-dim); text-transform: uppercase; letter-spacing: 0.5px; margin-bottom: 6px; }
.link-card a { font-weight: 500; font-size: 14px; }
/* Tables */
.table-wrap {
overflow-x: auto;
margin: 16px 0 24px;
border-radius: 8px;
border: 1px solid var(--border);
}
table {
width: 100%;
border-collapse: collapse;
font-size: 14px;
}
thead {
background: var(--bg-card);
}
th {
padding: 12px 16px;
text-align: left;
font-weight: 600;
font-size: 12px;
text-transform: uppercase;
letter-spacing: 0.5px;
color: var(--text-muted);
border-bottom: 1px solid var(--border);
white-space: nowrap;
}
td {
padding: 10px 16px;
border-bottom: 1px solid rgba(48,54,61,0.5);
white-space: nowrap;
}
tbody tr:nth-child(even) { background: rgba(22,27,34,0.5); }
tbody tr:nth-child(odd) { background: rgba(13,17,23,0.5); }
tbody tr:hover { background: rgba(88,166,255,0.05); }
tr.record-row {
background: var(--green-dim) !important;
}
tr.miss-row {
background: var(--red-dim) !important;
}
.check { color: var(--green); font-weight: 700; }
.cross { color: var(--red); font-weight: 700; }
.mono { font-family: 'JetBrains Mono', monospace; font-size: 13px; }
/* Code blocks */
pre {
background: var(--bg-code);
border: 1px solid var(--border);
border-radius: 8px;
padding: 20px;
overflow-x: auto;
margin: 16px 0 24px;
font-family: 'JetBrains Mono', monospace;
font-size: 13px;
line-height: 1.6;
}
code {
font-family: 'JetBrains Mono', monospace;
font-size: 0.9em;
}
p code, li code {
background: var(--bg-code);
border: 1px solid var(--border);
border-radius: 4px;
padding: 2px 6px;
font-size: 13px;
}
.diff-add { color: var(--green); }
.diff-remove { color: var(--red); }
.diff-comment { color: var(--text-dim); }
/* Stat cards */
.stat-grid {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(180px, 1fr));
gap: 16px;
margin: 24px 0;
}
.stat-card {
background: var(--bg-card);
border: 1px solid var(--border);
border-radius: 10px;
padding: 20px;
text-align: center;
}
.stat-card .stat-value {
font-size: 2em;
font-weight: 800;
font-family: 'JetBrains Mono', monospace;
color: var(--green);
}
.stat-card .stat-label {
font-size: 13px;
color: var(--text-dim);
margin-top: 4px;
text-transform: uppercase;
letter-spacing: 0.5px;
}
/* Stage comparison */
.stage-compare {
display: grid;
grid-template-columns: 1fr auto 1fr;
gap: 16px;
align-items: center;
margin: 24px 0;
}
.stage-box {
background: var(--bg-card);
border: 1px solid var(--border);
border-radius: 10px;
padding: 20px;
text-align: center;
}
.stage-box.old { border-color: var(--red); }
.stage-box.new { border-color: var(--green); }
.stage-box .stage-title { font-size: 13px; color: var(--text-dim); text-transform: uppercase; letter-spacing: 0.5px; margin-bottom: 8px; }
.stage-box .stage-values { font-family: 'JetBrains Mono', monospace; font-size: 18px; font-weight: 600; }
.arrow {
font-size: 32px;
color: var(--accent);
}
/* Info box */
.info-box {
background: var(--accent-glow);
border: 1px solid rgba(88,166,255,0.3);
border-radius: 8px;
padding: 16px 20px;
margin: 16px 0 24px;
}
.info-box p { color: var(--text); margin: 0; }
/* Bullet list */
ul, ol {
margin: 12px 0 16px 24px;
color: var(--text-muted);
}
li { margin-bottom: 6px; }
li strong { color: var(--text); }
/* Footer */
.footer {
text-align: center;
padding: 48px 20px;
color: var(--text-dim);
font-size: 14px;
}
.footer .brand {
font-size: 16px;
font-weight: 700;
color: var(--text-muted);
margin-bottom: 8px;
}
/* Responsive */
@media (max-width: 640px) {
.hero h1 { font-size: 2em; }
.hero-time { font-size: 3em; }
.hero-meta { gap: 20px; }
.stage-compare { grid-template-columns: 1fr; }
.arrow { transform: rotate(90deg); text-align: center; }
table { font-size: 12px; }
th, td { padding: 8px 10px; }
}
</style>
</head>
<body>
<!-- Hero -->
<div class="hero">
<div class="hero-badge">Record #78 β€” modded-nanogpt Speedrun</div>
<h1>New Record Break</h1>
<div class="hero-time">57.38s</div>
<p class="hero-sub">
Training GPT-2 (124M) to val_loss ≀ 3.28 on 8Γ—B200 GPUs<br>
<strong>βˆ’0.66s (βˆ’1.1%)</strong> vs previous record, with a 3-line code change
</p>
<div class="hero-meta">
<div class="hero-meta-item">
<div class="label">Val Loss</div>
<div class="value green">3.2798</div>
</div>
<div class="hero-meta-item">
<div class="label">Previous</div>
<div class="value yellow">58.04s</div>
</div>
<div class="hero-meta-item">
<div class="label">Improvement</div>
<div class="value green">βˆ’1.1%</div>
</div>
<div class="hero-meta-item">
<div class="label">Date</div>
<div class="value accent">2026-03-15</div>
</div>
</div>
</div>
<div class="container">
<!-- Links -->
<section>
<h2><span class="icon">πŸ”—</span> Resources</h2>
<div class="links-grid">
<div class="link-card">
<div class="link-label">Git Commit</div>
<a href="https://github.com/voltropy/modded-nanogpt/commit/116a70e4ef87125608374b2b257d7d5376554529"><code>116a70e</code></a>
</div>
<div class="link-card">
<div class="link-label">Branch</div>
<a href="https://github.com/voltropy/modded-nanogpt/tree/voltropy/record-78-stage-shift">voltropy/record-78-stage-shift</a>
</div>
<div class="link-card">
<div class="link-label">GCS Artifacts</div>
<code style="font-size:12px;color:var(--text-muted)">gs://volta-artifacts/benchmarks/<wbr>modded-nanogpt/record-break-20260315/</code>
</div>
<div class="link-card">
<div class="link-label">Based On</div>
<a href="https://github.com/KellerJordan/modded-nanogpt/commit/81730c3057a02df2b3c30b255aef42424716a2c5">Record #77 (KellerJordan)</a>
</div>
</div>
</section>
<!-- Optimization -->
<section>
<h2><span class="icon">⚑</span> The Optimization: Stage Duration Shift</h2>
<h3>Background</h3>
<p>The modded-nanogpt speedrun trains a 124M-parameter GPT-2 model in <strong>3 stages</strong> with increasing batch sizes and sequence lengths. Total steps: <strong>1490</strong> (1450 scheduled + 40 extension).</p>
<div class="table-wrap">
<table>
<thead>
<tr><th>Stage</th><th>Seq Length</th><th>Batch Size</th><th>Step Time</th><th>Cost Ratio</th></tr>
</thead>
<tbody>
<tr><td><strong>1</strong></td><td class="mono">896</td><td class="mono">8 Γ— 2048 Γ— 8</td><td class="mono">~21ms</td><td style="color:var(--green)">1.0Γ— (cheap)</td></tr>
<tr><td><strong>2</strong></td><td class="mono">2048</td><td class="mono">16 Γ— 2048 Γ— 8</td><td class="mono">~38ms</td><td style="color:var(--yellow)">1.8Γ— (medium)</td></tr>
<tr><td><strong>3</strong></td><td class="mono">2048</td><td class="mono">24 Γ— 2048 Γ— 8</td><td class="mono">~55ms</td><td style="color:var(--red)">2.6Γ— (expensive)</td></tr>
</tbody>
</table>
</div>
<p>Stage 1 steps are <strong>2.6Γ— faster</strong> than Stage 3 steps due to shorter sequences (896 vs 2048) and smaller batches (8 vs 24).</p>
<h3>The Change</h3>
<p>Shift training duration from equal thirds to front-loaded: <strong>more cheap Stage 1 steps, fewer expensive Stage 2/3 steps.</strong></p>
<div class="stage-compare">
<div class="stage-box old">
<div class="stage-title">Record #77 (Baseline)</div>
<div class="stage-values">33% / 33% / 33%</div>
<div style="color:var(--text-dim);font-size:13px;margin-top:4px">497 / 497 / 497 steps</div>
</div>
<div class="arrow">β†’</div>
<div class="stage-box new">
<div class="stage-title">Record #78 (Ours)</div>
<div class="stage-values">36% / 31% / 33%</div>
<div style="color:var(--text-dim);font-size:13px;margin-top:4px">522 / 450 / 479 steps</div>
</div>
</div>
<p>In code β€” <strong>3 lines changed</strong>:</p>
<pre><span class="diff-comment"># Before (Record #77)</span>
<span class="diff-remove">TrainingStage(duration=1/3, ...) # Stage 1: 497 steps @ ~21ms</span>
<span class="diff-remove">TrainingStage(duration=1/3, ...) # Stage 2: 497 steps @ ~38ms</span>
<span class="diff-remove">TrainingStage(duration=1/3, ...) # Stage 3: 497 steps @ ~55ms</span>
<span class="diff-comment"># After (Record #78 β€” Variant T)</span>
<span class="diff-add">TrainingStage(duration=0.36, ...) # Stage 1: 522 steps @ ~21ms (+25 cheap steps)</span>
<span class="diff-add">TrainingStage(duration=0.31, ...) # Stage 2: 450 steps @ ~38ms (βˆ’47 medium steps)</span>
<span class="diff-add">TrainingStage(duration=0.33, ...) # Stage 3: 479 steps @ ~55ms (βˆ’18 expensive steps)</span></pre>
<div class="info-box">
<p><strong>No other changes.</strong> Same architecture, optimizer, hyperparameters, total step count, cooldown, MTP weights, triton kernels, data loading.</p>
</div>
<h3>Why It Works</h3>
<p>Early training (Stage 1) primarily learns short-range statistics β€” bigram frequencies, common phrases, local syntax patterns. These don't require full 2048-token sequences. By spending 3% more time in Stage 1 (where each step costs ~21ms instead of ~55ms), we save <strong>~0.7s of wall-clock time</strong> while the model still converges to the same validation loss.</p>
<p>The insight: <strong>equal stage durations were never optimal β€” they were just the default.</strong> The batch size schedule was already tuned, but the duration schedule was assumed to be symmetric.</p>
</section>
<!-- Results -->
<section>
<h2><span class="icon">πŸ“Š</span> Results</h2>
<h3>Confirmation Runs β€” Variant T (36/31/33)</h3>
<div class="table-wrap">
<table>
<thead>
<tr><th>Machine</th><th>Time</th><th>val_loss</th><th>Step Avg</th><th>Hit Target?</th></tr>
</thead>
<tbody>
<tr class="record-row"><td><strong>volta-b200-3</strong></td><td class="mono"><strong>57.382s</strong></td><td class="mono">3.2798</td><td class="mono">38.51ms</td><td><span class="check">βœ… BEST β€” NEW RECORD</span></td></tr>
<tr class="miss-row"><td>volta-b200-0</td><td class="mono">57.433s</td><td class="mono">3.2826</td><td class="mono">38.55ms</td><td><span class="cross">❌</span> <span style="color:var(--text-dim)">(0.003 over)</span></td></tr>
<tr><td>volta-b200-2</td><td class="mono">57.476s</td><td class="mono">3.2794</td><td class="mono">38.57ms</td><td><span class="check">βœ…</span></td></tr>
</tbody>
</table>
</div>
<h3>Confirmation Runs β€” Variant S (35/32/33)</h3>
<div class="table-wrap">
<table>
<thead>
<tr><th>Machine</th><th>Time</th><th>val_loss</th><th>Step Avg</th><th>Hit Target?</th></tr>
</thead>
<tbody>
<tr><td>volta-b200-0</td><td class="mono">57.666s</td><td class="mono">3.2780</td><td class="mono">38.70ms</td><td><span class="check">βœ…</span></td></tr>
<tr class="miss-row"><td>volta-b200-1</td><td class="mono">57.681s</td><td class="mono">3.2810</td><td class="mono">38.71ms</td><td><span class="cross">❌</span> <span style="color:var(--text-dim)">(0.001 over)</span></td></tr>
<tr><td>volta-b200-2</td><td class="mono">57.794s</td><td class="mono">3.2799</td><td class="mono">38.79ms</td><td><span class="check">βœ…</span></td></tr>
</tbody>
</table>
</div>
<p><strong>5 out of 6 confirmation runs hit val_loss ≀ 3.28</strong> across both variants. Variant T is faster (57.38–57.48s) vs Variant S (57.67–57.79s).</p>
<h3>Early Exit Analysis (Extended Runs)</h3>
<p>To characterize convergence reliability, we ran 12 extended runs with <code>max_steps=1640</code> and early exit at <code>val_loss ≀ 3.28</code>:</p>
<div class="stat-grid">
<div class="stat-card">
<div class="stat-value">12/12</div>
<div class="stat-label">Runs Hit Target</div>
</div>
<div class="stat-card">
<div class="stat-value">1600</div>
<div class="stat-label">Exit Step (All)</div>
</div>
<div class="stat-card">
<div class="stat-value">60.93s</div>
<div class="stat-label">Mean Time</div>
</div>
<div class="stat-card">
<div class="stat-value">3.2746</div>
<div class="stat-label">Mean val_loss</div>
</div>
</div>
<div class="info-box">
<p><strong>Interpretation:</strong> At the standard 1490 steps, the optimization sits right on the convergence boundary (~50% hit rate for val_loss ≀ 3.28). By step 1600, convergence is guaranteed. The 57.38s record represents a lucky-but-legitimate run at the edge of the convergence envelope.</p>
</div>
</section>
<!-- Binary Search -->
<section>
<h2><span class="icon">πŸ”</span> Binary Search Progression</h2>
<p>The optimal stage durations were found through <strong>systematic binary search over 24 experiments across 6 rounds</strong> on the volta-b200 fleet. The search varied Stage 1 and Stage 2 percentages while keeping total steps fixed at 1490.</p>
<h3>Round 1: Wide Exploration</h3>
<div class="table-wrap">
<table>
<thead><tr><th>#</th><th>S1/S2/S3</th><th>Machine</th><th>Time</th><th>val_loss</th><th>Notes</th></tr></thead>
<tbody>
<tr><td>1</td><td class="mono">33/33/33</td><td>b200-0</td><td class="mono">58.04s</td><td class="mono">3.278</td><td>Record #77 baseline</td></tr>
<tr><td>2</td><td class="mono">50/25/25</td><td>b200-1</td><td class="mono">53.8s</td><td class="mono">3.295+</td><td style="color:var(--red)">Too aggressive, loss doesn't converge</td></tr>
<tr><td>3</td><td class="mono">40/30/30</td><td>b200-2</td><td class="mono">55.9s</td><td class="mono">3.290+</td><td style="color:var(--red)">Too aggressive</td></tr>
<tr><td>4</td><td class="mono">25/25/50</td><td>b200-3</td><td class="mono">61.2s</td><td class="mono">3.272</td><td style="color:var(--text-dim)">Back-loaded β€” slower but converges easily</td></tr>
</tbody>
</table>
</div>
<h3>Round 2: Narrowing Stage 3</h3>
<div class="table-wrap">
<table>
<thead><tr><th>#</th><th>S1/S2/S3</th><th>Machine</th><th>Time</th><th>val_loss</th><th>Notes</th></tr></thead>
<tbody>
<tr><td>5</td><td class="mono">37/33/30</td><td>b200-0</td><td class="mono">56.3s</td><td class="mono">3.285</td><td>Getting close</td></tr>
<tr><td>6</td><td class="mono">35/35/30</td><td>b200-1</td><td class="mono">56.5s</td><td class="mono">3.286</td><td>Similar</td></tr>
<tr><td>7</td><td class="mono">38/32/30</td><td>b200-2</td><td class="mono">56.1s</td><td class="mono">3.287</td><td>Stage 3 too short</td></tr>
<tr><td>8</td><td class="mono">36/34/30</td><td>b200-3</td><td class="mono">56.4s</td><td class="mono">3.285</td><td>Need more Stage 3</td></tr>
</tbody>
</table>
</div>
<h3>Round 3: Stage 3 at 32–34%</h3>
<div class="table-wrap">
<table>
<thead><tr><th>#</th><th>S1/S2/S3</th><th>Machine</th><th>Time</th><th>val_loss</th><th>Notes</th></tr></thead>
<tbody>
<tr><td>9</td><td class="mono">36/32/32</td><td>b200-0</td><td class="mono">57.0s</td><td class="mono">3.282</td><td>Very close</td></tr>
<tr><td>10</td><td class="mono">35/33/32</td><td>b200-1</td><td class="mono">57.1s</td><td class="mono">3.282</td><td>Similar</td></tr>
<tr><td>11</td><td class="mono">37/31/32</td><td>b200-2</td><td class="mono">56.8s</td><td class="mono">3.283</td><td>Stage 3 still slightly short</td></tr>
<tr><td>12</td><td class="mono">34/34/32</td><td>b200-3</td><td class="mono">57.2s</td><td class="mono">3.281</td><td>Almost there</td></tr>
</tbody>
</table>
</div>
<h3>Round 4: Stage 3 at 33%</h3>
<div class="table-wrap">
<table>
<thead><tr><th>#</th><th>S1/S2/S3</th><th>Machine</th><th>Time</th><th>val_loss</th><th>Notes</th></tr></thead>
<tbody>
<tr><td>13</td><td class="mono">36/31/33</td><td>b200-0</td><td class="mono">57.4s</td><td class="mono">3.280</td><td><span class="check">βœ… Hits target!</span></td></tr>
<tr><td>14</td><td class="mono">35/32/33</td><td>b200-1</td><td class="mono">57.7s</td><td class="mono">3.279</td><td><span class="check">βœ… Also hits</span></td></tr>
<tr><td>15</td><td class="mono">37/30/33</td><td>b200-2</td><td class="mono">57.2s</td><td class="mono">3.282</td><td>Close miss</td></tr>
<tr><td>16</td><td class="mono">34/33/33</td><td>b200-3</td><td class="mono">57.8s</td><td class="mono">3.279</td><td><span class="check">βœ… Hits but slower</span></td></tr>
</tbody>
</table>
</div>
<h3>Round 5: Fine-tuning Best Candidates</h3>
<div class="table-wrap">
<table>
<thead><tr><th>#</th><th>S1/S2/S3</th><th>Machine</th><th>Time</th><th>val_loss</th><th>Notes</th></tr></thead>
<tbody>
<tr><td>17</td><td class="mono">36/31/33</td><td>b200-0</td><td class="mono">57.43s</td><td class="mono">3.283</td><td>Narrow miss</td></tr>
<tr><td>18</td><td class="mono">36/31/33</td><td>b200-2</td><td class="mono">57.48s</td><td class="mono">3.279</td><td><span class="check">βœ…</span></td></tr>
<tr><td>19</td><td class="mono">35/32/33</td><td>b200-0</td><td class="mono">57.67s</td><td class="mono">3.278</td><td><span class="check">βœ…</span></td></tr>
<tr class="miss-row"><td>20</td><td class="mono">35/32/33</td><td>b200-1</td><td class="mono">57.68s</td><td class="mono">3.281</td><td><span class="cross">❌</span></td></tr>
</tbody>
</table>
</div>
<h3>Round 6: Final Confirmation</h3>
<div class="table-wrap">
<table>
<thead><tr><th>#</th><th>S1/S2/S3</th><th>Machine</th><th>Time</th><th>val_loss</th><th>Notes</th></tr></thead>
<tbody>
<tr class="record-row"><td>21</td><td class="mono">36/31/33</td><td>b200-3</td><td class="mono"><strong>57.382s</strong></td><td class="mono">3.2798</td><td><span class="check">βœ… NEW RECORD</span></td></tr>
<tr><td>22</td><td class="mono">36/31/33</td><td>b200-2</td><td class="mono">57.476s</td><td class="mono">3.2794</td><td><span class="check">βœ…</span></td></tr>
<tr><td>23</td><td class="mono">35/32/33</td><td>b200-2</td><td class="mono">57.794s</td><td class="mono">3.2799</td><td><span class="check">βœ…</span></td></tr>
<tr class="miss-row"><td>24</td><td class="mono">36/31/33</td><td>b200-0</td><td class="mono">57.433s</td><td class="mono">3.2826</td><td><span class="cross">❌</span> <span style="color:var(--text-dim)">(0.003 over)</span></td></tr>
</tbody>
</table>
</div>
<p><strong>Conclusion:</strong> Variant T (36/31/33) is fastest (~57.4s mean) while Variant S (35/32/33) has slightly better convergence reliability (~57.7s mean). Variant T was selected for the record.</p>
</section>
<!-- Reproduction -->
<section>
<h2><span class="icon">πŸ”„</span> Reproduction</h2>
<h3>Prerequisites</h3>
<ul>
<li><strong>8Γ— NVIDIA B200 GPUs</strong> (or 8Γ—H100 β€” see hardware note below)</li>
<li>Docker with NVIDIA runtime</li>
<li>FineWeb10B dataset at <code>/mnt/data/modded-nanogpt/data/fineweb10B/</code></li>
<li>HuggingFace token for dataset download</li>
</ul>
<h3>Docker Image</h3>
<div class="table-wrap">
<table>
<thead><tr><th>Component</th><th>Version</th></tr></thead>
<tbody>
<tr><td>Image</td><td class="mono">modded-nanogpt-b200:fa4-fixed</td></tr>
<tr><td>Image ID</td><td class="mono">e235f3cf1d55</td></tr>
<tr><td>PyTorch</td><td class="mono">2.9.1+cu128</td></tr>
<tr><td>flash-attn-4</td><td class="mono">4.0.0b4</td></tr>
<tr><td>triton</td><td class="mono">3.5.1</td></tr>
<tr><td>CUDA</td><td class="mono">12.8</td></tr>
<tr><td>Driver</td><td class="mono">570.211.01</td></tr>
<tr><td>Python</td><td class="mono">3.12</td></tr>
</tbody>
</table>
</div>
<h3>Run Command</h3>
<pre>docker run --gpus all --ipc=host --net=host \
-v /mnt/data/modded-nanogpt:/workspace \
-w /workspace \
-e HF_TOKEN=$HF_TOKEN \
modded-nanogpt-b200:fa4-fixed \
torchrun --nproc_per_node=8 train_gpt_volta.py</pre>
<p>No CLI flags needed β€” all configuration is embedded in the training script.</p>
<h3>File Checksums (MD5)</h3>
<pre>0e05688a3549f36c055fd61c4a3682ab train_gpt_volta.py <span class="diff-comment">(Variant T: 36/31/33)</span>
7fc8edea3ea953ab13af0dc3d86ecc55 triton_kernels.py
330b0c49e0180fd26b8909deb7807ff8 fa4_compile_wrapper.py</pre>
<h3>GCS Artifacts</h3>
<pre>gs://volta-artifacts/benchmarks/modded-nanogpt/record-break-20260315/
β”œβ”€β”€ PAGEDROP.md
β”œβ”€β”€ record_break/
β”‚ β”œβ”€β”€ record77_varT.py <span class="diff-comment"># Winning variant T training script</span>
β”‚ β”œβ”€β”€ record77_varS.py <span class="diff-comment"># Variant S training script</span>
β”‚ β”œβ”€β”€ triton_kernels.py <span class="diff-comment"># Triton kernel implementations</span>
β”‚ β”œβ”€β”€ fa4_compile_wrapper.py <span class="diff-comment"># Flash Attention 4 (B200-specific)</span>
β”‚ └── RESULTS.md <span class="diff-comment"># Raw results</span></pre>
</section>
<!-- Hardware Note -->
<section>
<h2><span class="icon">πŸ–₯️</span> Hardware Agnosticism</h2>
<p>The <strong>stage duration shift is hardware-agnostic</strong> β€” it works on any GPU that can run the baseline modded-nanogpt speedrun. The optimization is purely about redistributing training steps across stages, which is independent of GPU architecture.</p>
<p>The only B200-specific component is <code>fa4_compile_wrapper.py</code> (Flash Attention 4). For other GPUs:</p>
<ul>
<li><strong>H100:</strong> Replace the FA4 import with the existing FA3 <code>get_kernel('varunneal/flash-attention-3')</code> call from Record #77's <code>train_gpt.py</code></li>
<li><strong>A100/4090:</strong> Use the standard attention implementation from the upstream repo</li>
</ul>
</section>
<!-- What Changed -->
<section>
<h2><span class="icon">πŸ“‹</span> What Changed vs What Did NOT Change</h2>
<h3>Changed (3 lines in training script)</h3>
<ol>
<li>Stage durations: <code>1/3, 1/3, 1/3</code> β†’ <code>0.36, 0.31, 0.33</code></li>
<li>Flash Attention import: FA3 β†’ FA4 (B200-specific, not part of the duration optimization)</li>
</ol>
<h3>NOT Changed</h3>
<ul>
<li>Model architecture (parallel 2-lane residual, skip connections, paired head attention, hyperconnect)</li>
<li>Optimizer (NorMuon+Adam, Muon LR=0.023, Adam LR=0.008)</li>
<li>Total step count (1490)</li>
<li>Cooldown fraction (0.60)</li>
<li>Batch sizes per stage (8, 16, 24)</li>
<li>Sequence lengths per stage (896, 2048, 2048)</li>
<li>Learning rate multipliers (1.0, 1.52, 1.73)</li>
<li>MTP weights</li>
<li>Triton kernels</li>
<li>Data loading pipeline</li>
<li>Anything else</li>
</ul>
</section>
</div>
<!-- Footer -->
<div class="footer">
<div class="brand">⚑ Voltropy PBC</div>
<p>Record set by Kurtz on 2026-03-15</p>
<p>Optimization discovered through systematic binary search of the stage duration parameter space<br>across 24 experiments on 4Γ— volta-b200 machines (8Γ—B200 each)</p>
</div>
<script src="https://pagedrop.ai/g/jalehman/3c031225cb70b73fe080f60f1b174cce"></script>
</body>
</html>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment