Skip to content

Instantly share code, notes, and snippets.

@belisarius222
Created April 4, 2026 03:55
Show Gist options
  • Select an option

  • Save belisarius222/14eef91a586f8a00acc787efaafdc9de to your computer and use it in GitHub Desktop.

Select an option

Save belisarius222/14eef91a586f8a00acc787efaafdc9de to your computer and use it in GitHub Desktop.
<!DOCTYPE html>
<html lang="en" data-theme="dark">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Staged-Polymorphic Omega System — Training Report</title>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/@picocss/pico@2/css/pico.min.css">
<style>
:root {
--pico-font-size: 16px;
}
body { padding-bottom: 4rem; }
h1 { margin-bottom: 0.25em; }
.subtitle { color: var(--pico-muted-color); margin-bottom: 2rem; font-size: 1.1rem; }
.result-card {
border-left: 4px solid var(--pico-primary);
padding: 1rem 1.5rem;
margin: 1.5rem 0;
background: var(--pico-card-background-color);
border-radius: 0 var(--pico-border-radius) var(--pico-border-radius) 0;
}
.result-card.success { border-left-color: #22c55e; }
.result-card.info { border-left-color: var(--pico-primary); }
.beat { color: #22c55e; font-weight: bold; }
.miss { color: #f59e0b; }
table { font-variant-numeric: tabular-nums; }
.operator-grid { display: grid; grid-template-columns: 1fr; gap: 1.5rem; margin: 1.5rem 0; }
@media (min-width: 768px) { .operator-grid { grid-template-columns: 1fr 1fr; } }
.op-card {
background: var(--pico-card-background-color);
border-radius: var(--pico-border-radius);
padding: 1.25rem;
border: 1px solid var(--pico-muted-border-color);
}
.op-card h3 { margin-top: 0; margin-bottom: 0.5rem; }
.op-num {
display: inline-block;
background: var(--pico-primary);
color: var(--pico-primary-inverse);
width: 1.6em; height: 1.6em;
text-align: center; line-height: 1.6em;
border-radius: 50%; font-weight: bold;
margin-right: 0.4em; font-size: 0.9em;
}
.analogy { font-style: italic; color: var(--pico-muted-color); margin-top: 0.75rem; }
.chain-diagram {
background: var(--pico-card-background-color);
border-radius: var(--pico-border-radius);
padding: 1.5rem;
font-family: monospace;
font-size: 0.9rem;
line-height: 1.8;
overflow-x: auto;
white-space: pre;
border: 1px solid var(--pico-muted-border-color);
}
details summary { cursor: pointer; font-weight: 600; }
details[open] summary { margin-bottom: 0.75rem; }
hr { margin: 2.5rem 0; }
.tag {
display: inline-block;
background: var(--pico-primary);
color: var(--pico-primary-inverse);
padding: 0.15em 0.5em;
border-radius: 4px;
font-size: 0.8rem;
font-weight: 600;
}
.tag.green { background: #22c55e; }
</style>
</head>
<body>
<main class="container">
<h1>Staged-Polymorphic Omega System</h1>
<p class="subtitle">Training Report &mdash; April 3, 2026</p>
<!-- ==================== STATUS ==================== -->
<section>
<h2>Status</h2>
<div class="result-card success">
<strong>The learned model now beats the teacher on all 3 seeds with correct runtime semantics.</strong>
</div>
<h3>Key Fix: Init Supervision in Joint Stage</h3>
<p>
The learned model's init predictions were <strong>2&times; worse</strong> than the teacher's because of a
train/eval mask mismatch. Adding <code>init_z_loss</code> (0.5 weight) to the joint loss broke the
chicken-and-egg cycle where the evaluator starved init of gradient.
</p>
<h3>Results <small>(32 eval campaigns each)</small></h3>
<table>
<thead>
<tr><th>Seed</th><th>Joint Epochs</th><th>Teacher</th><th>Learned</th><th>Beats Teacher?</th></tr>
</thead>
<tbody>
<tr><td>13</td><td>8</td><td>0.688</td><td><strong>0.639</strong></td><td><span class="beat">YES (&minus;7.1%)</span></td></tr>
<tr><td>14</td><td>12</td><td>0.686</td><td><strong>0.678</strong></td><td><span class="beat">YES (&minus;1.2%)</span></td></tr>
<tr><td>15</td><td>8</td><td>0.706</td><td><strong>0.630</strong></td><td><span class="beat">YES (&minus;10.8%)</span></td></tr>
</tbody>
</table>
<h3>Git State</h3>
<ul>
<li>On <code>main</code> at <code>f381bc6</code>, pushed to origin</li>
<li>Clean working tree (no uncommitted changes)</li>
<li>Default <code>--joint-epochs</code> is still 4 in <code>omega_train.py</code></li>
</ul>
<h3>What's Left</h3>
<ul>
<li>Bump default joint epochs to 12 (pending confirmation)</li>
<li>Could also add generate/update z supervision to joint stage for further gains</li>
<li>The <code>--teacher-eval-in-joint</code> flag exists but hurts &mdash; could remove to clean up</li>
</ul>
</section>
<hr>
<!-- ==================== OVERVIEW ==================== -->
<section>
<h2>What This System Does</h2>
<h3>The Problem</h3>
<p>
Imagine you have a <strong>teacher</strong> &mdash; a hand-coded algorithm that solves math problems
(specifically, linear regression tasks). The teacher is pretty good. It looks at a problem, picks a
strategy, solves it, and then learns from the experience to do better on the next problem.
</p>
<p>
We want to build a <strong>student</strong> (a neural network) that watches the teacher work, learns to
imitate it, and eventually does <em>better</em> than the teacher.
</p>
<h3>The Teacher's Job</h3>
<p>
The teacher solves problems in <strong>campaigns</strong> &mdash; sequences of 16 related tasks. For each task, it:
</p>
<ol>
<li><strong>Residualize</strong> &mdash; Figures out which dimensions of the problem matter</li>
<li><strong>Init</strong> &mdash; Makes a quick first guess at the answer</li>
<li><strong>Update</strong> &mdash; Iteratively refines the answer (slow but reliable)</li>
<li><strong>Generate</strong> &mdash; Tries to jump directly to a good answer (fast but risky)</li>
<li><strong>Evaluate</strong> &mdash; Picks the best of init/generate/update</li>
<li><strong>Promote</strong> &mdash; Decides whether to update its long-term memory</li>
<li><strong>Reflect</strong> &mdash; Adjusts its own strategy knobs for next time</li>
</ol>
<p>
Across a campaign, the teacher builds up <strong>family memory</strong> (patterns it recognizes) and can
even <strong>spawn child models</strong> (specialized solvers for problem types it sees repeatedly).
</p>
<h3>The Training Process</h3>
<p>
We generate thousands of campaigns where the teacher solves problems, recording everything: what it saw,
what it decided, and how well it did. This is the <strong>corpus</strong> &mdash; about 32,000 solved tasks.
</p>
<p>Then we train the neural network in <strong>stages</strong>:</p>
<ol>
<li><strong>Stages 1&ndash;5:</strong> Train each operator separately. The network learns to predict the
teacher's outputs individually.</li>
<li><strong>Stage 6 (Joint):</strong> Wire everything together. The network runs its own full pipeline &mdash;
its init feeds into its generate, its evaluator picks from its own candidates. This is where the operators
learn to work <em>together</em>, not just individually.</li>
</ol>
<h3>Why "Beats Teacher" Is Hard</h3>
<p>
At evaluation time, the neural network runs autonomously &mdash; no teacher guidance. It makes decisions,
updates its memory, and those decisions affect future tasks in the campaign.
<strong>Errors compound:</strong> a bad promote decision corrupts the memory, which leads to a bad
base_theta for the next task, which leads to worse init/generate/update, and so on.
</p>
<p>
The teacher has perfect rule-based logic. The student has to approximate all of it with learned weights.
Getting the student to <em>exceed</em> the teacher means the student found strategies the hand-coded
rules missed.
</p>
<div class="result-card info">
<strong>The Breakthrough:</strong> The student's init predictions were terrible at runtime &mdash; 2&times;
worse than the teacher's. The evaluator learned "init is bad, avoid it," which starved init of gradient.
Adding direct init supervision broke the cycle &mdash; init improved, the evaluator started selecting it,
and the whole pipeline got better. Result: <strong>1&ndash;11% improvement over the teacher</strong> across
all seeds.
</div>
</section>
<hr>
<!-- ==================== SIX OPERATORS ==================== -->
<section>
<h2>The Six Operators in Detail</h2>
<p>
The system solves <strong>linear regression tasks</strong>: given input-output pairs (support data), find a
parameter vector <code>theta</code> (8 dimensions) such that <code>y &asymp; X @ theta</code>. Held-out
"val" data measures solution quality.
</p>
<p>
Each task belongs to a <strong>family</strong> (like "in_basis_easy", "off_basis", "mixed") describing how
the true answer relates to a shared low-dimensional subspace (the "adapter basis" &mdash; 6 basis vectors
in 8-d space).
</p>
<div class="operator-grid">
<div class="op-card">
<h3><span class="op-num">1</span> Residualize</h3>
<p>
Figures out <strong>which parts of the subspace matter</strong> for this task. The system has 6 basis
vectors; not all are relevant every time. Outputs a <strong>mask</strong> (which vectors to use) and a
<strong>rank</strong> (how many to activate).
</p>
<p>The output <code>ResidualSpec</code> is used by all downstream operators &mdash; everything else works
in this reduced subspace.</p>
<p class="analogy">Like a photographer choosing which lenses to mount before taking a shot.</p>
</div>
<div class="op-card">
<h3><span class="op-num">2</span> Init</h3>
<p>
Produces a <strong>quick first guess</strong> at the solution in a single forward pass. Given the task
embedding, memory state, base_theta, and the mask, it predicts a <code>z</code> vector &mdash;
coordinates in the adapter subspace. Final theta = <code>base_theta + basis @ z</code>.
</p>
<p>Cheap but limited &mdash; only as good as a learned function of the inputs allows.</p>
<p class="analogy">Like pattern matching: "given what this problem looks like, here's roughly the answer."</p>
</div>
<div class="op-card">
<h3><span class="op-num">3</span> Update</h3>
<p>
<strong>Iteratively refines</strong> init's guess using gradient descent on support data. Starting from
init's z, it runs 2&ndash;4 learned gradient steps. Each step computes the gradient of support loss
w.r.t. z and applies a <em>learned</em> update rule (not raw gradient descent).
</p>
<p>Slower than init (multiple steps) but more reliable &mdash; directly optimizes on the data.</p>
<p class="analogy">Like practice: actually working through the problem step by step.</p>
</div>
<div class="op-card">
<h3><span class="op-num">4</span> Generate</h3>
<p>
Tries to <strong>jump directly to a good solution</strong> in one shot, taking init's z as input and
predicting a better z. It has learned shortcuts from many (task, init_z, optimal_z) examples during
training.
</p>
<p>High-risk, high-reward: great answers with almost no compute when it works, worse than init when it
fails. That's why the evaluator exists.</p>
<p class="analogy">Like intuition: "when the first guess looks like this, the real answer is usually over there."</p>
</div>
<div class="op-card">
<h3><span class="op-num">5</span> Evaluate + Promote</h3>
<p>
<strong>Evaluate</strong> picks the best candidate from init, generate, and update using a neural
network that sees each candidate's losses and parameter norms.
</p>
<p>
<strong>Promote</strong> then decides: how much to blend the winner into the family prototype
(<code>prototype_alpha</code>), whether to update the global prior (<code>slow_gate</code>), and
whether to spawn a specialized child model (<code>spawn_gate</code>).
</p>
<p>This is how the system builds <strong>long-term memory</strong> across a campaign.</p>
</div>
<div class="op-card">
<h3><span class="op-num">6</span> Reflect</h3>
<p>
<strong>Adjusts the system's own hyperparameters</strong> based on recent performance: learning rate,
number of update steps, promote aggressiveness, generate acceptance threshold, reflection window size,
and spawn thresholds.
</p>
<p>
This is <strong>meta-learning</strong>: the system tunes itself over a campaign. Early on it may be
exploratory; later, as it accumulates knowledge, it becomes more conservative.
</p>
</div>
</div>
<h3>How They Chain Together</h3>
<div class="chain-diagram">residualize &rarr; init &rarr; [update, generate] &rarr; evaluate &rarr; promote &rarr; reflect
&darr;
updates memory &amp; policy
&darr;
next task uses updated state</div>
<p style="margin-top: 1rem;">
<strong>Decisions compound.</strong> A good promote decision improves base_theta for the next task. A good
reflect decision tunes the policy so update takes the right number of steps. A bad decision in any operator
cascades forward through the entire campaign.
</p>
</section>
</main>
</body>
</html>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment