Summarising LSE AI Leadership Course 2 with magistral-small

Date: June 2026
Course: LSE_AI 201: Going from Vision to Value_P1_2026
Model: magistral-small-latest (Mistral AI, EU-hosted)
Method: Overlapping chunking (11,000 char window, 10,000 char step, 1,000 char overlap) → per-chunk summary → concatenated summary

Why this approach

Course 201 has 94 canvas pages and 12 video transcript sidecars. The transcripts range from 2,840 chars (single short video) to 56,878 chars (5.5-hour panel discussion). Most course pages are 2–15KB of markdown. Pages that link to video transcripts inline them before chunking so the page summary covers both the page prose and the video content.

magistral-small was selected from the benchmark at https://gist.github.com/simbo1905/6db319fe5e1b5264ea605209493a7543 as the top-quality EU-hosted model (A-grade, 83% win rate in blind pairwise comparison). It runs on Mistral AI's EU infrastructure — relevant for GDPR/EU AI Act compliance. Cost is negligible: under $0.01 for the full course.

Naming convention

Every summary file is named identically to its source with _summary inserted before .md:

PAGE_SLUG.md                     → PAGE_SLUG_summary.md
PAGE_SLUG_video_TITLE.md         → PAGE_SLUG_video_TITLE_summary.md

Summaries live beside their sources in the same module directory. No separate summaries folder. This makes navigation trivial and keeps the git diff readable.

Full convention documented in: CANVAS-CRAWL-RUNBOOK-07-summarise.md

Script

scripts/summarize_201.py — takes --transcript or --page, reads MISTRAL_API_KEY from env or .env, chunks the source, calls magistral-small per chunk, strips thinking blocks, concatenates with --- separators, writes SOURCE_summary.md beside the source.

Validation sample (6 items before full run)

The 6 items were chosen to cover all size brackets and types.

#	Item	Type	Raw chars	Summary chars	Ratio	Chunks	Result
1	112_video_how-ai-investments (large transcript)	transcript	56,714	18,972	33%	6	✅
2	211_video_how-ai-investments (same source, different location)	transcript	56,714	15,790	28%	6	✅
3	421_video_bacardi-interview-parts-1-5 (medium transcript)	transcript	26,161	7,243	28%	3	✅
4	211_video_zoomo-scenario-introduction (small transcript)	transcript	6,422	2,220	35%	1	✅
5	page 1.0/111 (small page, no transcript)	page	10,149	2,026	20%	1	✅
6	page 2.0/211 (large page + 3 inlined transcripts)	page	77,761	19,130	25%	8	✅

Bug found and fixed during validation: Pages whose length falls between 10,000 and 11,000 chars produced a degenerate 2nd chunk of <1,000 chars (just the overlap stub). The model hallucinated content for that stub. Fixed by dropping any trailing chunk shorter than the overlap (1,000 chars). The fix was verified by re-running affected samples.

Progress log

Video transcripts (12 total — all complete)

#	File	Raw chars	Summary chars	Ratio	Chunks
1	112_video_how-ai-investments-influence-business-process-performance	56,714	16,947	30%	6
2	211_video_how-ai-investments-influence-business-process-performance	56,714	18,112	32%	6
3	211_video_the-importance-of-ai-readiness	3,908	1,012	26%	1
4	211_video_zoomo-scenario-introduction	6,422	2,220	35%	1
5	221_video_why-an-ai-readiness-assessment-part-1	6,613	3,654	55%	1
6	221_video_why-an-ai-readiness-assessment-part-2	6,393	1,845	29%	1
7	311_video_crafting-an-ai-business-case	5,639	1,181	21%	1
8	324_video_aligning-ai-with-strategy	5,845	2,594	44%	1
9	411_video_what-success-looks-like	5,294	1,943	37%	1
10	421_video_bacardi-interview-parts-1-5	26,161	7,243	28%	3
11	427_video_types-of-risks-in-ai-projects	3,413	2,019	59%	1
12	429_video_risk-mitigation-strategies	2,840	1,299	46%	1

Course pages (94 total — all complete)

All 94 pages summarised in module order: 0.0 → 1.0 → 2.0 → 3.0 → 4.0 → 5.0 → 6.0 → misc. Pages with linked transcript sidecars had them inlined before chunking, so page summaries cover both prose content and video material in a single summary file.

Total output: 106 summary files (12 transcript + 94 page)

Expected size ratios (from benchmark + observed results)

Source type	Typical raw	Typical summary	Ratio
Large transcript (>50KB)	57,000	16,000–19,000	28–33%
Medium transcript (20–30KB)	26,000	8,000–10,000	30–35%
Small transcript (<7KB)	2,000–7,000	800–2,500	30–40%
Small page (no transcript, <5KB)	1,000–5,000	400–1,500	25–40%
Large page + inlined transcripts	10,000–80,000	3,000–20,000	25–35%

Quality checks per summary

Starts with # Summary: TITLE
Contains substantive content — not just headings
No mid-sentence truncation
Compression ratio within expected range
No hallucinated facts or invented structure

Final artefacts

After all 94 pages + 12 videos are done:

106 _summary.md files committed beside their sources
Convention documented in CANVAS-CRAWL-RUNBOOK-07-summarise.md
Script at scripts/summarize_201.py (also pushed to gist for reuse)
This writeup posted as a public gist

Updated in real time as each item completes.

simbo1905/summarise-201-writeup.md

Select an option

No results found