Date: June 2026
Course: LSE_AI 201: Going from Vision to Value_P1_2026
Model: magistral-small-latest (Mistral AI, EU-hosted)
Method: Overlapping chunking (11,000 char window, 10,000 char step, 1,000 char overlap) → per-chunk summary → concatenated summary
Course 201 has 94 canvas pages and 12 video transcript sidecars. The transcripts range from 2,840 chars (single short video) to 56,878 chars (5.5-hour panel discussion). Most course pages are 2–15KB of markdown. Pages that link to video transcripts inline them before chunking so the page summary covers both the page prose and the video content.
magistral-small was selected from the benchmark at https://gist.github.com/simbo1905/6db319fe5e1b5264ea605209493a7543 as the top-quality EU-hosted model (A-grade, 83% win rate in blind pairwise comparison). It runs on Mistral AI's EU infrastructure — relevant for GDPR/EU AI Act compliance. Cost is negligible: under $0.01 for the full course.
Every summary file is named identically to its source with _summary inserted
before .md:
PAGE_SLUG.md → PAGE_SLUG_summary.md
PAGE_SLUG_video_TITLE.md → PAGE_SLUG_video_TITLE_summary.md
Summaries live beside their sources in the same module directory. No separate summaries folder. This makes navigation trivial and keeps the git diff readable.
Full convention documented in: CANVAS-CRAWL-RUNBOOK-07-summarise.md
scripts/summarize_201.py — takes --transcript or --page, reads
MISTRAL_API_KEY from env or .env, chunks the source, calls magistral-small
per chunk, strips thinking blocks, concatenates with --- separators, writes
SOURCE_summary.md beside the source.
The 6 items were chosen to cover all size brackets and types.
| # | Item | Type | Raw chars | Summary chars | Ratio | Chunks | Result |
|---|---|---|---|---|---|---|---|
| 1 | 112_video_how-ai-investments (large transcript) | transcript | 56,714 | 18,972 | 33% | 6 | ✅ |
| 2 | 211_video_how-ai-investments (same source, different location) | transcript | 56,714 | 15,790 | 28% | 6 | ✅ |
| 3 | 421_video_bacardi-interview-parts-1-5 (medium transcript) | transcript | 26,161 | 7,243 | 28% | 3 | ✅ |
| 4 | 211_video_zoomo-scenario-introduction (small transcript) | transcript | 6,422 | 2,220 | 35% | 1 | ✅ |
| 5 | page 1.0/111 (small page, no transcript) | page | 10,149 | 2,026 | 20% | 1 | ✅ |
| 6 | page 2.0/211 (large page + 3 inlined transcripts) | page | 77,761 | 19,130 | 25% | 8 | ✅ |
Bug found and fixed during validation: Pages whose length falls between 10,000 and 11,000 chars produced a degenerate 2nd chunk of <1,000 chars (just the overlap stub). The model hallucinated content for that stub. Fixed by dropping any trailing chunk shorter than the overlap (1,000 chars). The fix was verified by re-running affected samples.
| # | File | Raw chars | Summary chars | Ratio | Chunks |
|---|---|---|---|---|---|
| 1 | 112_video_how-ai-investments-influence-business-process-performance | 56,714 | 16,947 | 30% | 6 |
| 2 | 211_video_how-ai-investments-influence-business-process-performance | 56,714 | 18,112 | 32% | 6 |
| 3 | 211_video_the-importance-of-ai-readiness | 3,908 | 1,012 | 26% | 1 |
| 4 | 211_video_zoomo-scenario-introduction | 6,422 | 2,220 | 35% | 1 |
| 5 | 221_video_why-an-ai-readiness-assessment-part-1 | 6,613 | 3,654 | 55% | 1 |
| 6 | 221_video_why-an-ai-readiness-assessment-part-2 | 6,393 | 1,845 | 29% | 1 |
| 7 | 311_video_crafting-an-ai-business-case | 5,639 | 1,181 | 21% | 1 |
| 8 | 324_video_aligning-ai-with-strategy | 5,845 | 2,594 | 44% | 1 |
| 9 | 411_video_what-success-looks-like | 5,294 | 1,943 | 37% | 1 |
| 10 | 421_video_bacardi-interview-parts-1-5 | 26,161 | 7,243 | 28% | 3 |
| 11 | 427_video_types-of-risks-in-ai-projects | 3,413 | 2,019 | 59% | 1 |
| 12 | 429_video_risk-mitigation-strategies | 2,840 | 1,299 | 46% | 1 |
All 94 pages summarised in module order: 0.0 → 1.0 → 2.0 → 3.0 → 4.0 → 5.0 → 6.0 → misc. Pages with linked transcript sidecars had them inlined before chunking, so page summaries cover both prose content and video material in a single summary file.
Total output: 106 summary files (12 transcript + 94 page)
| Source type | Typical raw | Typical summary | Ratio |
|---|---|---|---|
| Large transcript (>50KB) | 57,000 | 16,000–19,000 | 28–33% |
| Medium transcript (20–30KB) | 26,000 | 8,000–10,000 | 30–35% |
| Small transcript (<7KB) | 2,000–7,000 | 800–2,500 | 30–40% |
| Small page (no transcript, <5KB) | 1,000–5,000 | 400–1,500 | 25–40% |
| Large page + inlined transcripts | 10,000–80,000 | 3,000–20,000 | 25–35% |
- Starts with
# Summary: TITLE - Contains substantive content — not just headings
- No mid-sentence truncation
- Compression ratio within expected range
- No hallucinated facts or invented structure
After all 94 pages + 12 videos are done:
- 106
_summary.mdfiles committed beside their sources - Convention documented in
CANVAS-CRAWL-RUNBOOK-07-summarise.md - Script at
scripts/summarize_201.py(also pushed to gist for reuse) - This writeup posted as a public gist
Updated in real time as each item completes.