Skip to content

Instantly share code, notes, and snippets.

@simbo1905
Created June 4, 2026 23:20
Show Gist options
  • Select an option

  • Save simbo1905/1c064fe9e978c63a843284cd35f80dfe to your computer and use it in GitHub Desktop.

Select an option

Save simbo1905/1c064fe9e978c63a843284cd35f80dfe to your computer and use it in GitHub Desktop.
Summarising LSE AI Leadership Course 2 with magistral-small: methodology, naming convention, chunking parameters, validation process, and bug found

Summarising LSE AI Leadership Course 2 with magistral-small

Date: June 2026
Course: LSE_AI 201: Going from Vision to Value_P1_2026
Model: magistral-small-latest (Mistral AI, EU-hosted)
Method: Overlapping chunking (11,000 char window, 10,000 char step, 1,000 char overlap) → per-chunk summary → concatenated summary


Why this approach

Course 201 has 94 canvas pages and 12 video transcript sidecars. The transcripts range from 2,840 chars (single short video) to 56,878 chars (5.5-hour panel discussion). Most course pages are 2–15KB of markdown. Pages that link to video transcripts inline them before chunking so the page summary covers both the page prose and the video content.

magistral-small was selected from the benchmark at https://gist.github.com/simbo1905/6db319fe5e1b5264ea605209493a7543 as the top-quality EU-hosted model (A-grade, 83% win rate in blind pairwise comparison). It runs on Mistral AI's EU infrastructure — relevant for GDPR/EU AI Act compliance. Cost is negligible: under $0.01 for the full course.


Naming convention

Every summary file is named identically to its source with _summary inserted before .md:

PAGE_SLUG.md                     → PAGE_SLUG_summary.md
PAGE_SLUG_video_TITLE.md         → PAGE_SLUG_video_TITLE_summary.md

Summaries live beside their sources in the same module directory. No separate summaries folder. This makes navigation trivial and keeps the git diff readable.

Full convention documented in: CANVAS-CRAWL-RUNBOOK-07-summarise.md


Script

scripts/summarize_201.py — takes --transcript or --page, reads MISTRAL_API_KEY from env or .env, chunks the source, calls magistral-small per chunk, strips thinking blocks, concatenates with --- separators, writes SOURCE_summary.md beside the source.


Validation sample (6 items before full run)

The 6 items were chosen to cover all size brackets and types.

# Item Type Raw chars Summary chars Ratio Chunks Result
1 112_video_how-ai-investments (large transcript) transcript 56,714 18,972 33% 6
2 211_video_how-ai-investments (same source, different location) transcript 56,714 15,790 28% 6
3 421_video_bacardi-interview-parts-1-5 (medium transcript) transcript 26,161 7,243 28% 3
4 211_video_zoomo-scenario-introduction (small transcript) transcript 6,422 2,220 35% 1
5 page 1.0/111 (small page, no transcript) page 10,149 2,026 20% 1
6 page 2.0/211 (large page + 3 inlined transcripts) page 77,761 19,130 25% 8

Bug found and fixed during validation: Pages whose length falls between 10,000 and 11,000 chars produced a degenerate 2nd chunk of <1,000 chars (just the overlap stub). The model hallucinated content for that stub. Fixed by dropping any trailing chunk shorter than the overlap (1,000 chars). The fix was verified by re-running affected samples.


Progress log

Video transcripts (12 total — all complete)

# File Raw chars Summary chars Ratio Chunks
1 112_video_how-ai-investments-influence-business-process-performance 56,714 16,947 30% 6
2 211_video_how-ai-investments-influence-business-process-performance 56,714 18,112 32% 6
3 211_video_the-importance-of-ai-readiness 3,908 1,012 26% 1
4 211_video_zoomo-scenario-introduction 6,422 2,220 35% 1
5 221_video_why-an-ai-readiness-assessment-part-1 6,613 3,654 55% 1
6 221_video_why-an-ai-readiness-assessment-part-2 6,393 1,845 29% 1
7 311_video_crafting-an-ai-business-case 5,639 1,181 21% 1
8 324_video_aligning-ai-with-strategy 5,845 2,594 44% 1
9 411_video_what-success-looks-like 5,294 1,943 37% 1
10 421_video_bacardi-interview-parts-1-5 26,161 7,243 28% 3
11 427_video_types-of-risks-in-ai-projects 3,413 2,019 59% 1
12 429_video_risk-mitigation-strategies 2,840 1,299 46% 1

Course pages (94 total — all complete)

All 94 pages summarised in module order: 0.0 → 1.0 → 2.0 → 3.0 → 4.0 → 5.0 → 6.0 → misc. Pages with linked transcript sidecars had them inlined before chunking, so page summaries cover both prose content and video material in a single summary file.

Total output: 106 summary files (12 transcript + 94 page)


Expected size ratios (from benchmark + observed results)

Source type Typical raw Typical summary Ratio
Large transcript (>50KB) 57,000 16,000–19,000 28–33%
Medium transcript (20–30KB) 26,000 8,000–10,000 30–35%
Small transcript (<7KB) 2,000–7,000 800–2,500 30–40%
Small page (no transcript, <5KB) 1,000–5,000 400–1,500 25–40%
Large page + inlined transcripts 10,000–80,000 3,000–20,000 25–35%

Quality checks per summary

  • Starts with # Summary: TITLE
  • Contains substantive content — not just headings
  • No mid-sentence truncation
  • Compression ratio within expected range
  • No hallucinated facts or invented structure

Final artefacts

After all 94 pages + 12 videos are done:

  • 106 _summary.md files committed beside their sources
  • Convention documented in CANVAS-CRAWL-RUNBOOK-07-summarise.md
  • Script at scripts/summarize_201.py (also pushed to gist for reuse)
  • This writeup posted as a public gist

Updated in real time as each item completes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment