Created
March 15, 2026 18:49
-
-
Save belisarius222/bc2e6bc6bc2478e9f75eb5d934a2ba09 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| <!DOCTYPE html> | |
| <html lang="en"> | |
| <head> | |
| <meta charset="utf-8"> | |
| <meta name="viewport" content="width=device-width, initial-scale=1"> | |
| <title>Record Break β modded-nanogpt 57.38s on 8ΓB200</title> | |
| <style> | |
| @import url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800&family=JetBrains+Mono:wght@400;500&display=swap'); | |
| :root { | |
| --bg: #0d1117; | |
| --bg-card: #161b22; | |
| --bg-code: #1c2129; | |
| --border: #30363d; | |
| --text: #e6edf3; | |
| --text-muted: #8b949e; | |
| --text-dim: #6e7681; | |
| --accent: #58a6ff; | |
| --accent-glow: rgba(88, 166, 255, 0.15); | |
| --green: #3fb950; | |
| --green-dim: rgba(63, 185, 80, 0.15); | |
| --red: #f85149; | |
| --red-dim: rgba(248, 81, 73, 0.15); | |
| --yellow: #d29922; | |
| --purple: #bc8cff; | |
| } | |
| * { margin: 0; padding: 0; box-sizing: border-box; } | |
| body { | |
| font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif; | |
| background: var(--bg); | |
| color: var(--text); | |
| line-height: 1.7; | |
| font-size: 16px; | |
| } | |
| .hero { | |
| text-align: center; | |
| padding: 80px 20px 60px; | |
| background: linear-gradient(180deg, #0d1117 0%, #111920 50%, #0d1117 100%); | |
| border-bottom: 1px solid var(--border); | |
| position: relative; | |
| overflow: hidden; | |
| } | |
| .hero::before { | |
| content: ''; | |
| position: absolute; | |
| top: -50%; | |
| left: 50%; | |
| transform: translateX(-50%); | |
| width: 800px; | |
| height: 800px; | |
| background: radial-gradient(circle, rgba(88,166,255,0.06) 0%, transparent 70%); | |
| pointer-events: none; | |
| } | |
| .hero-badge { | |
| display: inline-block; | |
| background: var(--accent-glow); | |
| border: 1px solid rgba(88,166,255,0.3); | |
| color: var(--accent); | |
| padding: 6px 16px; | |
| border-radius: 20px; | |
| font-size: 13px; | |
| font-weight: 600; | |
| letter-spacing: 0.5px; | |
| text-transform: uppercase; | |
| margin-bottom: 24px; | |
| } | |
| .hero h1 { | |
| font-size: 3.2em; | |
| font-weight: 800; | |
| letter-spacing: -1.5px; | |
| margin-bottom: 8px; | |
| background: linear-gradient(135deg, #fff 0%, #58a6ff 100%); | |
| -webkit-background-clip: text; | |
| -webkit-text-fill-color: transparent; | |
| background-clip: text; | |
| } | |
| .hero-time { | |
| font-size: 5em; | |
| font-weight: 800; | |
| font-family: 'JetBrains Mono', monospace; | |
| color: var(--green); | |
| margin: 16px 0; | |
| letter-spacing: -2px; | |
| text-shadow: 0 0 60px rgba(63,185,80,0.3); | |
| } | |
| .hero-sub { | |
| font-size: 1.2em; | |
| color: var(--text-muted); | |
| max-width: 600px; | |
| margin: 0 auto; | |
| } | |
| .hero-meta { | |
| display: flex; | |
| justify-content: center; | |
| gap: 32px; | |
| margin-top: 32px; | |
| flex-wrap: wrap; | |
| } | |
| .hero-meta-item { | |
| text-align: center; | |
| } | |
| .hero-meta-item .label { | |
| font-size: 12px; | |
| text-transform: uppercase; | |
| letter-spacing: 1px; | |
| color: var(--text-dim); | |
| margin-bottom: 4px; | |
| } | |
| .hero-meta-item .value { | |
| font-size: 18px; | |
| font-weight: 600; | |
| font-family: 'JetBrains Mono', monospace; | |
| } | |
| .hero-meta-item .value.green { color: var(--green); } | |
| .hero-meta-item .value.accent { color: var(--accent); } | |
| .hero-meta-item .value.yellow { color: var(--yellow); } | |
| .container { | |
| max-width: 900px; | |
| margin: 0 auto; | |
| padding: 0 24px; | |
| } | |
| section { | |
| padding: 48px 0; | |
| border-bottom: 1px solid var(--border); | |
| } | |
| section:last-of-type { border-bottom: none; } | |
| h2 { | |
| font-size: 1.8em; | |
| font-weight: 700; | |
| margin-bottom: 24px; | |
| letter-spacing: -0.5px; | |
| display: flex; | |
| align-items: center; | |
| gap: 12px; | |
| } | |
| h2 .icon { | |
| font-size: 0.8em; | |
| } | |
| h3 { | |
| font-size: 1.25em; | |
| font-weight: 600; | |
| margin: 32px 0 16px; | |
| color: var(--accent); | |
| } | |
| p { | |
| margin-bottom: 16px; | |
| color: var(--text-muted); | |
| } | |
| p strong { color: var(--text); } | |
| a { | |
| color: var(--accent); | |
| text-decoration: none; | |
| } | |
| a:hover { text-decoration: underline; } | |
| /* Links table */ | |
| .links-grid { | |
| display: grid; | |
| grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); | |
| gap: 12px; | |
| margin: 24px 0; | |
| } | |
| .link-card { | |
| background: var(--bg-card); | |
| border: 1px solid var(--border); | |
| border-radius: 8px; | |
| padding: 16px; | |
| transition: border-color 0.2s; | |
| } | |
| .link-card:hover { border-color: var(--accent); } | |
| .link-card .link-label { font-size: 12px; color: var(--text-dim); text-transform: uppercase; letter-spacing: 0.5px; margin-bottom: 6px; } | |
| .link-card a { font-weight: 500; font-size: 14px; } | |
| /* Tables */ | |
| .table-wrap { | |
| overflow-x: auto; | |
| margin: 16px 0 24px; | |
| border-radius: 8px; | |
| border: 1px solid var(--border); | |
| } | |
| table { | |
| width: 100%; | |
| border-collapse: collapse; | |
| font-size: 14px; | |
| } | |
| thead { | |
| background: var(--bg-card); | |
| } | |
| th { | |
| padding: 12px 16px; | |
| text-align: left; | |
| font-weight: 600; | |
| font-size: 12px; | |
| text-transform: uppercase; | |
| letter-spacing: 0.5px; | |
| color: var(--text-muted); | |
| border-bottom: 1px solid var(--border); | |
| white-space: nowrap; | |
| } | |
| td { | |
| padding: 10px 16px; | |
| border-bottom: 1px solid rgba(48,54,61,0.5); | |
| white-space: nowrap; | |
| } | |
| tbody tr:nth-child(even) { background: rgba(22,27,34,0.5); } | |
| tbody tr:nth-child(odd) { background: rgba(13,17,23,0.5); } | |
| tbody tr:hover { background: rgba(88,166,255,0.05); } | |
| tr.record-row { | |
| background: var(--green-dim) !important; | |
| } | |
| tr.miss-row { | |
| background: var(--red-dim) !important; | |
| } | |
| .check { color: var(--green); font-weight: 700; } | |
| .cross { color: var(--red); font-weight: 700; } | |
| .mono { font-family: 'JetBrains Mono', monospace; font-size: 13px; } | |
| /* Code blocks */ | |
| pre { | |
| background: var(--bg-code); | |
| border: 1px solid var(--border); | |
| border-radius: 8px; | |
| padding: 20px; | |
| overflow-x: auto; | |
| margin: 16px 0 24px; | |
| font-family: 'JetBrains Mono', monospace; | |
| font-size: 13px; | |
| line-height: 1.6; | |
| } | |
| code { | |
| font-family: 'JetBrains Mono', monospace; | |
| font-size: 0.9em; | |
| } | |
| p code, li code { | |
| background: var(--bg-code); | |
| border: 1px solid var(--border); | |
| border-radius: 4px; | |
| padding: 2px 6px; | |
| font-size: 13px; | |
| } | |
| .diff-add { color: var(--green); } | |
| .diff-remove { color: var(--red); } | |
| .diff-comment { color: var(--text-dim); } | |
| /* Stat cards */ | |
| .stat-grid { | |
| display: grid; | |
| grid-template-columns: repeat(auto-fit, minmax(180px, 1fr)); | |
| gap: 16px; | |
| margin: 24px 0; | |
| } | |
| .stat-card { | |
| background: var(--bg-card); | |
| border: 1px solid var(--border); | |
| border-radius: 10px; | |
| padding: 20px; | |
| text-align: center; | |
| } | |
| .stat-card .stat-value { | |
| font-size: 2em; | |
| font-weight: 800; | |
| font-family: 'JetBrains Mono', monospace; | |
| color: var(--green); | |
| } | |
| .stat-card .stat-label { | |
| font-size: 13px; | |
| color: var(--text-dim); | |
| margin-top: 4px; | |
| text-transform: uppercase; | |
| letter-spacing: 0.5px; | |
| } | |
| /* Stage comparison */ | |
| .stage-compare { | |
| display: grid; | |
| grid-template-columns: 1fr auto 1fr; | |
| gap: 16px; | |
| align-items: center; | |
| margin: 24px 0; | |
| } | |
| .stage-box { | |
| background: var(--bg-card); | |
| border: 1px solid var(--border); | |
| border-radius: 10px; | |
| padding: 20px; | |
| text-align: center; | |
| } | |
| .stage-box.old { border-color: var(--red); } | |
| .stage-box.new { border-color: var(--green); } | |
| .stage-box .stage-title { font-size: 13px; color: var(--text-dim); text-transform: uppercase; letter-spacing: 0.5px; margin-bottom: 8px; } | |
| .stage-box .stage-values { font-family: 'JetBrains Mono', monospace; font-size: 18px; font-weight: 600; } | |
| .arrow { | |
| font-size: 32px; | |
| color: var(--accent); | |
| } | |
| /* Info box */ | |
| .info-box { | |
| background: var(--accent-glow); | |
| border: 1px solid rgba(88,166,255,0.3); | |
| border-radius: 8px; | |
| padding: 16px 20px; | |
| margin: 16px 0 24px; | |
| } | |
| .info-box p { color: var(--text); margin: 0; } | |
| /* Bullet list */ | |
| ul, ol { | |
| margin: 12px 0 16px 24px; | |
| color: var(--text-muted); | |
| } | |
| li { margin-bottom: 6px; } | |
| li strong { color: var(--text); } | |
| /* Footer */ | |
| .footer { | |
| text-align: center; | |
| padding: 48px 20px; | |
| color: var(--text-dim); | |
| font-size: 14px; | |
| } | |
| .footer .brand { | |
| font-size: 16px; | |
| font-weight: 700; | |
| color: var(--text-muted); | |
| margin-bottom: 8px; | |
| } | |
| /* Responsive */ | |
| @media (max-width: 640px) { | |
| .hero h1 { font-size: 2em; } | |
| .hero-time { font-size: 3em; } | |
| .hero-meta { gap: 20px; } | |
| .stage-compare { grid-template-columns: 1fr; } | |
| .arrow { transform: rotate(90deg); text-align: center; } | |
| table { font-size: 12px; } | |
| th, td { padding: 8px 10px; } | |
| } | |
| </style> | |
| </head> | |
| <body> | |
| <!-- Hero --> | |
| <div class="hero"> | |
| <div class="hero-badge">Record #78 β modded-nanogpt Speedrun</div> | |
| <h1>New Record Break</h1> | |
| <div class="hero-time">57.38s</div> | |
| <p class="hero-sub"> | |
| Training GPT-2 (124M) to val_loss β€ 3.28 on 8ΓB200 GPUs<br> | |
| <strong>β0.66s (β1.1%)</strong> vs previous record, with a 3-line code change | |
| </p> | |
| <div class="hero-meta"> | |
| <div class="hero-meta-item"> | |
| <div class="label">Val Loss</div> | |
| <div class="value green">3.2798</div> | |
| </div> | |
| <div class="hero-meta-item"> | |
| <div class="label">Previous</div> | |
| <div class="value yellow">58.04s</div> | |
| </div> | |
| <div class="hero-meta-item"> | |
| <div class="label">Improvement</div> | |
| <div class="value green">β1.1%</div> | |
| </div> | |
| <div class="hero-meta-item"> | |
| <div class="label">Date</div> | |
| <div class="value accent">2026-03-15</div> | |
| </div> | |
| </div> | |
| </div> | |
| <div class="container"> | |
| <!-- Links --> | |
| <section> | |
| <h2><span class="icon">π</span> Resources</h2> | |
| <div class="links-grid"> | |
| <div class="link-card"> | |
| <div class="link-label">Git Commit</div> | |
| <a href="https://github.com/voltropy/modded-nanogpt/commit/116a70e4ef87125608374b2b257d7d5376554529"><code>116a70e</code></a> | |
| </div> | |
| <div class="link-card"> | |
| <div class="link-label">Branch</div> | |
| <a href="https://github.com/voltropy/modded-nanogpt/tree/voltropy/record-78-stage-shift">voltropy/record-78-stage-shift</a> | |
| </div> | |
| <div class="link-card"> | |
| <div class="link-label">GCS Artifacts</div> | |
| <code style="font-size:12px;color:var(--text-muted)">gs://volta-artifacts/benchmarks/<wbr>modded-nanogpt/record-break-20260315/</code> | |
| </div> | |
| <div class="link-card"> | |
| <div class="link-label">Based On</div> | |
| <a href="https://github.com/KellerJordan/modded-nanogpt/commit/81730c3057a02df2b3c30b255aef42424716a2c5">Record #77 (KellerJordan)</a> | |
| </div> | |
| </div> | |
| </section> | |
| <!-- Optimization --> | |
| <section> | |
| <h2><span class="icon">β‘</span> The Optimization: Stage Duration Shift</h2> | |
| <h3>Background</h3> | |
| <p>The modded-nanogpt speedrun trains a 124M-parameter GPT-2 model in <strong>3 stages</strong> with increasing batch sizes and sequence lengths. Total steps: <strong>1490</strong> (1450 scheduled + 40 extension).</p> | |
| <div class="table-wrap"> | |
| <table> | |
| <thead> | |
| <tr><th>Stage</th><th>Seq Length</th><th>Batch Size</th><th>Step Time</th><th>Cost Ratio</th></tr> | |
| </thead> | |
| <tbody> | |
| <tr><td><strong>1</strong></td><td class="mono">896</td><td class="mono">8 Γ 2048 Γ 8</td><td class="mono">~21ms</td><td style="color:var(--green)">1.0Γ (cheap)</td></tr> | |
| <tr><td><strong>2</strong></td><td class="mono">2048</td><td class="mono">16 Γ 2048 Γ 8</td><td class="mono">~38ms</td><td style="color:var(--yellow)">1.8Γ (medium)</td></tr> | |
| <tr><td><strong>3</strong></td><td class="mono">2048</td><td class="mono">24 Γ 2048 Γ 8</td><td class="mono">~55ms</td><td style="color:var(--red)">2.6Γ (expensive)</td></tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| <p>Stage 1 steps are <strong>2.6Γ faster</strong> than Stage 3 steps due to shorter sequences (896 vs 2048) and smaller batches (8 vs 24).</p> | |
| <h3>The Change</h3> | |
| <p>Shift training duration from equal thirds to front-loaded: <strong>more cheap Stage 1 steps, fewer expensive Stage 2/3 steps.</strong></p> | |
| <div class="stage-compare"> | |
| <div class="stage-box old"> | |
| <div class="stage-title">Record #77 (Baseline)</div> | |
| <div class="stage-values">33% / 33% / 33%</div> | |
| <div style="color:var(--text-dim);font-size:13px;margin-top:4px">497 / 497 / 497 steps</div> | |
| </div> | |
| <div class="arrow">β</div> | |
| <div class="stage-box new"> | |
| <div class="stage-title">Record #78 (Ours)</div> | |
| <div class="stage-values">36% / 31% / 33%</div> | |
| <div style="color:var(--text-dim);font-size:13px;margin-top:4px">522 / 450 / 479 steps</div> | |
| </div> | |
| </div> | |
| <p>In code β <strong>3 lines changed</strong>:</p> | |
| <pre><span class="diff-comment"># Before (Record #77)</span> | |
| <span class="diff-remove">TrainingStage(duration=1/3, ...) # Stage 1: 497 steps @ ~21ms</span> | |
| <span class="diff-remove">TrainingStage(duration=1/3, ...) # Stage 2: 497 steps @ ~38ms</span> | |
| <span class="diff-remove">TrainingStage(duration=1/3, ...) # Stage 3: 497 steps @ ~55ms</span> | |
| <span class="diff-comment"># After (Record #78 β Variant T)</span> | |
| <span class="diff-add">TrainingStage(duration=0.36, ...) # Stage 1: 522 steps @ ~21ms (+25 cheap steps)</span> | |
| <span class="diff-add">TrainingStage(duration=0.31, ...) # Stage 2: 450 steps @ ~38ms (β47 medium steps)</span> | |
| <span class="diff-add">TrainingStage(duration=0.33, ...) # Stage 3: 479 steps @ ~55ms (β18 expensive steps)</span></pre> | |
| <div class="info-box"> | |
| <p><strong>No other changes.</strong> Same architecture, optimizer, hyperparameters, total step count, cooldown, MTP weights, triton kernels, data loading.</p> | |
| </div> | |
| <h3>Why It Works</h3> | |
| <p>Early training (Stage 1) primarily learns short-range statistics β bigram frequencies, common phrases, local syntax patterns. These don't require full 2048-token sequences. By spending 3% more time in Stage 1 (where each step costs ~21ms instead of ~55ms), we save <strong>~0.7s of wall-clock time</strong> while the model still converges to the same validation loss.</p> | |
| <p>The insight: <strong>equal stage durations were never optimal β they were just the default.</strong> The batch size schedule was already tuned, but the duration schedule was assumed to be symmetric.</p> | |
| </section> | |
| <!-- Results --> | |
| <section> | |
| <h2><span class="icon">π</span> Results</h2> | |
| <h3>Confirmation Runs β Variant T (36/31/33)</h3> | |
| <div class="table-wrap"> | |
| <table> | |
| <thead> | |
| <tr><th>Machine</th><th>Time</th><th>val_loss</th><th>Step Avg</th><th>Hit Target?</th></tr> | |
| </thead> | |
| <tbody> | |
| <tr class="record-row"><td><strong>volta-b200-3</strong></td><td class="mono"><strong>57.382s</strong></td><td class="mono">3.2798</td><td class="mono">38.51ms</td><td><span class="check">β BEST β NEW RECORD</span></td></tr> | |
| <tr class="miss-row"><td>volta-b200-0</td><td class="mono">57.433s</td><td class="mono">3.2826</td><td class="mono">38.55ms</td><td><span class="cross">β</span> <span style="color:var(--text-dim)">(0.003 over)</span></td></tr> | |
| <tr><td>volta-b200-2</td><td class="mono">57.476s</td><td class="mono">3.2794</td><td class="mono">38.57ms</td><td><span class="check">β </span></td></tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| <h3>Confirmation Runs β Variant S (35/32/33)</h3> | |
| <div class="table-wrap"> | |
| <table> | |
| <thead> | |
| <tr><th>Machine</th><th>Time</th><th>val_loss</th><th>Step Avg</th><th>Hit Target?</th></tr> | |
| </thead> | |
| <tbody> | |
| <tr><td>volta-b200-0</td><td class="mono">57.666s</td><td class="mono">3.2780</td><td class="mono">38.70ms</td><td><span class="check">β </span></td></tr> | |
| <tr class="miss-row"><td>volta-b200-1</td><td class="mono">57.681s</td><td class="mono">3.2810</td><td class="mono">38.71ms</td><td><span class="cross">β</span> <span style="color:var(--text-dim)">(0.001 over)</span></td></tr> | |
| <tr><td>volta-b200-2</td><td class="mono">57.794s</td><td class="mono">3.2799</td><td class="mono">38.79ms</td><td><span class="check">β </span></td></tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| <p><strong>5 out of 6 confirmation runs hit val_loss β€ 3.28</strong> across both variants. Variant T is faster (57.38β57.48s) vs Variant S (57.67β57.79s).</p> | |
| <h3>Early Exit Analysis (Extended Runs)</h3> | |
| <p>To characterize convergence reliability, we ran 12 extended runs with <code>max_steps=1640</code> and early exit at <code>val_loss β€ 3.28</code>:</p> | |
| <div class="stat-grid"> | |
| <div class="stat-card"> | |
| <div class="stat-value">12/12</div> | |
| <div class="stat-label">Runs Hit Target</div> | |
| </div> | |
| <div class="stat-card"> | |
| <div class="stat-value">1600</div> | |
| <div class="stat-label">Exit Step (All)</div> | |
| </div> | |
| <div class="stat-card"> | |
| <div class="stat-value">60.93s</div> | |
| <div class="stat-label">Mean Time</div> | |
| </div> | |
| <div class="stat-card"> | |
| <div class="stat-value">3.2746</div> | |
| <div class="stat-label">Mean val_loss</div> | |
| </div> | |
| </div> | |
| <div class="info-box"> | |
| <p><strong>Interpretation:</strong> At the standard 1490 steps, the optimization sits right on the convergence boundary (~50% hit rate for val_loss β€ 3.28). By step 1600, convergence is guaranteed. The 57.38s record represents a lucky-but-legitimate run at the edge of the convergence envelope.</p> | |
| </div> | |
| </section> | |
| <!-- Binary Search --> | |
| <section> | |
| <h2><span class="icon">π</span> Binary Search Progression</h2> | |
| <p>The optimal stage durations were found through <strong>systematic binary search over 24 experiments across 6 rounds</strong> on the volta-b200 fleet. The search varied Stage 1 and Stage 2 percentages while keeping total steps fixed at 1490.</p> | |
| <h3>Round 1: Wide Exploration</h3> | |
| <div class="table-wrap"> | |
| <table> | |
| <thead><tr><th>#</th><th>S1/S2/S3</th><th>Machine</th><th>Time</th><th>val_loss</th><th>Notes</th></tr></thead> | |
| <tbody> | |
| <tr><td>1</td><td class="mono">33/33/33</td><td>b200-0</td><td class="mono">58.04s</td><td class="mono">3.278</td><td>Record #77 baseline</td></tr> | |
| <tr><td>2</td><td class="mono">50/25/25</td><td>b200-1</td><td class="mono">53.8s</td><td class="mono">3.295+</td><td style="color:var(--red)">Too aggressive, loss doesn't converge</td></tr> | |
| <tr><td>3</td><td class="mono">40/30/30</td><td>b200-2</td><td class="mono">55.9s</td><td class="mono">3.290+</td><td style="color:var(--red)">Too aggressive</td></tr> | |
| <tr><td>4</td><td class="mono">25/25/50</td><td>b200-3</td><td class="mono">61.2s</td><td class="mono">3.272</td><td style="color:var(--text-dim)">Back-loaded β slower but converges easily</td></tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| <h3>Round 2: Narrowing Stage 3</h3> | |
| <div class="table-wrap"> | |
| <table> | |
| <thead><tr><th>#</th><th>S1/S2/S3</th><th>Machine</th><th>Time</th><th>val_loss</th><th>Notes</th></tr></thead> | |
| <tbody> | |
| <tr><td>5</td><td class="mono">37/33/30</td><td>b200-0</td><td class="mono">56.3s</td><td class="mono">3.285</td><td>Getting close</td></tr> | |
| <tr><td>6</td><td class="mono">35/35/30</td><td>b200-1</td><td class="mono">56.5s</td><td class="mono">3.286</td><td>Similar</td></tr> | |
| <tr><td>7</td><td class="mono">38/32/30</td><td>b200-2</td><td class="mono">56.1s</td><td class="mono">3.287</td><td>Stage 3 too short</td></tr> | |
| <tr><td>8</td><td class="mono">36/34/30</td><td>b200-3</td><td class="mono">56.4s</td><td class="mono">3.285</td><td>Need more Stage 3</td></tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| <h3>Round 3: Stage 3 at 32β34%</h3> | |
| <div class="table-wrap"> | |
| <table> | |
| <thead><tr><th>#</th><th>S1/S2/S3</th><th>Machine</th><th>Time</th><th>val_loss</th><th>Notes</th></tr></thead> | |
| <tbody> | |
| <tr><td>9</td><td class="mono">36/32/32</td><td>b200-0</td><td class="mono">57.0s</td><td class="mono">3.282</td><td>Very close</td></tr> | |
| <tr><td>10</td><td class="mono">35/33/32</td><td>b200-1</td><td class="mono">57.1s</td><td class="mono">3.282</td><td>Similar</td></tr> | |
| <tr><td>11</td><td class="mono">37/31/32</td><td>b200-2</td><td class="mono">56.8s</td><td class="mono">3.283</td><td>Stage 3 still slightly short</td></tr> | |
| <tr><td>12</td><td class="mono">34/34/32</td><td>b200-3</td><td class="mono">57.2s</td><td class="mono">3.281</td><td>Almost there</td></tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| <h3>Round 4: Stage 3 at 33%</h3> | |
| <div class="table-wrap"> | |
| <table> | |
| <thead><tr><th>#</th><th>S1/S2/S3</th><th>Machine</th><th>Time</th><th>val_loss</th><th>Notes</th></tr></thead> | |
| <tbody> | |
| <tr><td>13</td><td class="mono">36/31/33</td><td>b200-0</td><td class="mono">57.4s</td><td class="mono">3.280</td><td><span class="check">β Hits target!</span></td></tr> | |
| <tr><td>14</td><td class="mono">35/32/33</td><td>b200-1</td><td class="mono">57.7s</td><td class="mono">3.279</td><td><span class="check">β Also hits</span></td></tr> | |
| <tr><td>15</td><td class="mono">37/30/33</td><td>b200-2</td><td class="mono">57.2s</td><td class="mono">3.282</td><td>Close miss</td></tr> | |
| <tr><td>16</td><td class="mono">34/33/33</td><td>b200-3</td><td class="mono">57.8s</td><td class="mono">3.279</td><td><span class="check">β Hits but slower</span></td></tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| <h3>Round 5: Fine-tuning Best Candidates</h3> | |
| <div class="table-wrap"> | |
| <table> | |
| <thead><tr><th>#</th><th>S1/S2/S3</th><th>Machine</th><th>Time</th><th>val_loss</th><th>Notes</th></tr></thead> | |
| <tbody> | |
| <tr><td>17</td><td class="mono">36/31/33</td><td>b200-0</td><td class="mono">57.43s</td><td class="mono">3.283</td><td>Narrow miss</td></tr> | |
| <tr><td>18</td><td class="mono">36/31/33</td><td>b200-2</td><td class="mono">57.48s</td><td class="mono">3.279</td><td><span class="check">β </span></td></tr> | |
| <tr><td>19</td><td class="mono">35/32/33</td><td>b200-0</td><td class="mono">57.67s</td><td class="mono">3.278</td><td><span class="check">β </span></td></tr> | |
| <tr class="miss-row"><td>20</td><td class="mono">35/32/33</td><td>b200-1</td><td class="mono">57.68s</td><td class="mono">3.281</td><td><span class="cross">β</span></td></tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| <h3>Round 6: Final Confirmation</h3> | |
| <div class="table-wrap"> | |
| <table> | |
| <thead><tr><th>#</th><th>S1/S2/S3</th><th>Machine</th><th>Time</th><th>val_loss</th><th>Notes</th></tr></thead> | |
| <tbody> | |
| <tr class="record-row"><td>21</td><td class="mono">36/31/33</td><td>b200-3</td><td class="mono"><strong>57.382s</strong></td><td class="mono">3.2798</td><td><span class="check">β NEW RECORD</span></td></tr> | |
| <tr><td>22</td><td class="mono">36/31/33</td><td>b200-2</td><td class="mono">57.476s</td><td class="mono">3.2794</td><td><span class="check">β </span></td></tr> | |
| <tr><td>23</td><td class="mono">35/32/33</td><td>b200-2</td><td class="mono">57.794s</td><td class="mono">3.2799</td><td><span class="check">β </span></td></tr> | |
| <tr class="miss-row"><td>24</td><td class="mono">36/31/33</td><td>b200-0</td><td class="mono">57.433s</td><td class="mono">3.2826</td><td><span class="cross">β</span> <span style="color:var(--text-dim)">(0.003 over)</span></td></tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| <p><strong>Conclusion:</strong> Variant T (36/31/33) is fastest (~57.4s mean) while Variant S (35/32/33) has slightly better convergence reliability (~57.7s mean). Variant T was selected for the record.</p> | |
| </section> | |
| <!-- Reproduction --> | |
| <section> | |
| <h2><span class="icon">π</span> Reproduction</h2> | |
| <h3>Prerequisites</h3> | |
| <ul> | |
| <li><strong>8Γ NVIDIA B200 GPUs</strong> (or 8ΓH100 β see hardware note below)</li> | |
| <li>Docker with NVIDIA runtime</li> | |
| <li>FineWeb10B dataset at <code>/mnt/data/modded-nanogpt/data/fineweb10B/</code></li> | |
| <li>HuggingFace token for dataset download</li> | |
| </ul> | |
| <h3>Docker Image</h3> | |
| <div class="table-wrap"> | |
| <table> | |
| <thead><tr><th>Component</th><th>Version</th></tr></thead> | |
| <tbody> | |
| <tr><td>Image</td><td class="mono">modded-nanogpt-b200:fa4-fixed</td></tr> | |
| <tr><td>Image ID</td><td class="mono">e235f3cf1d55</td></tr> | |
| <tr><td>PyTorch</td><td class="mono">2.9.1+cu128</td></tr> | |
| <tr><td>flash-attn-4</td><td class="mono">4.0.0b4</td></tr> | |
| <tr><td>triton</td><td class="mono">3.5.1</td></tr> | |
| <tr><td>CUDA</td><td class="mono">12.8</td></tr> | |
| <tr><td>Driver</td><td class="mono">570.211.01</td></tr> | |
| <tr><td>Python</td><td class="mono">3.12</td></tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| <h3>Run Command</h3> | |
| <pre>docker run --gpus all --ipc=host --net=host \ | |
| -v /mnt/data/modded-nanogpt:/workspace \ | |
| -w /workspace \ | |
| -e HF_TOKEN=$HF_TOKEN \ | |
| modded-nanogpt-b200:fa4-fixed \ | |
| torchrun --nproc_per_node=8 train_gpt_volta.py</pre> | |
| <p>No CLI flags needed β all configuration is embedded in the training script.</p> | |
| <h3>File Checksums (MD5)</h3> | |
| <pre>0e05688a3549f36c055fd61c4a3682ab train_gpt_volta.py <span class="diff-comment">(Variant T: 36/31/33)</span> | |
| 7fc8edea3ea953ab13af0dc3d86ecc55 triton_kernels.py | |
| 330b0c49e0180fd26b8909deb7807ff8 fa4_compile_wrapper.py</pre> | |
| <h3>GCS Artifacts</h3> | |
| <pre>gs://volta-artifacts/benchmarks/modded-nanogpt/record-break-20260315/ | |
| βββ PAGEDROP.md | |
| βββ record_break/ | |
| β βββ record77_varT.py <span class="diff-comment"># Winning variant T training script</span> | |
| β βββ record77_varS.py <span class="diff-comment"># Variant S training script</span> | |
| β βββ triton_kernels.py <span class="diff-comment"># Triton kernel implementations</span> | |
| β βββ fa4_compile_wrapper.py <span class="diff-comment"># Flash Attention 4 (B200-specific)</span> | |
| β βββ RESULTS.md <span class="diff-comment"># Raw results</span></pre> | |
| </section> | |
| <!-- Hardware Note --> | |
| <section> | |
| <h2><span class="icon">π₯οΈ</span> Hardware Agnosticism</h2> | |
| <p>The <strong>stage duration shift is hardware-agnostic</strong> β it works on any GPU that can run the baseline modded-nanogpt speedrun. The optimization is purely about redistributing training steps across stages, which is independent of GPU architecture.</p> | |
| <p>The only B200-specific component is <code>fa4_compile_wrapper.py</code> (Flash Attention 4). For other GPUs:</p> | |
| <ul> | |
| <li><strong>H100:</strong> Replace the FA4 import with the existing FA3 <code>get_kernel('varunneal/flash-attention-3')</code> call from Record #77's <code>train_gpt.py</code></li> | |
| <li><strong>A100/4090:</strong> Use the standard attention implementation from the upstream repo</li> | |
| </ul> | |
| </section> | |
| <!-- What Changed --> | |
| <section> | |
| <h2><span class="icon">π</span> What Changed vs What Did NOT Change</h2> | |
| <h3>Changed (3 lines in training script)</h3> | |
| <ol> | |
| <li>Stage durations: <code>1/3, 1/3, 1/3</code> β <code>0.36, 0.31, 0.33</code></li> | |
| <li>Flash Attention import: FA3 β FA4 (B200-specific, not part of the duration optimization)</li> | |
| </ol> | |
| <h3>NOT Changed</h3> | |
| <ul> | |
| <li>Model architecture (parallel 2-lane residual, skip connections, paired head attention, hyperconnect)</li> | |
| <li>Optimizer (NorMuon+Adam, Muon LR=0.023, Adam LR=0.008)</li> | |
| <li>Total step count (1490)</li> | |
| <li>Cooldown fraction (0.60)</li> | |
| <li>Batch sizes per stage (8, 16, 24)</li> | |
| <li>Sequence lengths per stage (896, 2048, 2048)</li> | |
| <li>Learning rate multipliers (1.0, 1.52, 1.73)</li> | |
| <li>MTP weights</li> | |
| <li>Triton kernels</li> | |
| <li>Data loading pipeline</li> | |
| <li>Anything else</li> | |
| </ul> | |
| </section> | |
| </div> | |
| <!-- Footer --> | |
| <div class="footer"> | |
| <div class="brand">β‘ Voltropy PBC</div> | |
| <p>Record set by Kurtz on 2026-03-15</p> | |
| <p>Optimization discovered through systematic binary search of the stage duration parameter space<br>across 24 experiments on 4Γ volta-b200 machines (8ΓB200 each)</p> | |
| </div> | |
| <script src="https://pagedrop.ai/g/jalehman/3c031225cb70b73fe080f60f1b174cce"></script> | |
| </body> | |
| </html> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment