An LLM in code is just a function call over HTTP. There's no magic "spinning up" — you either hit a local daemon or a cloud endpoint, send {model, messages, options}, and get text back. For this spike I'd use Ollama (local): it's the on-prem, llama.cpp-family runtime we identified as the production sharp tool, so what you learn transfers directly; no API key; and it has a JSON mode. If ollama serve is running, you POST http://localhost:11434/api/chat with {"model": "llama3.1" (or any instruct model you've pulled), "messages": [...], "format": "json", "stream": false, "options": {"temperature": 0}}, and the JSON you want is in response ["message"]["content"] (as a string you still have to json.loads). urllib from the stdlib is enough — no new deps.
Three concepts that are the whole skill, each a deliberate choice:
- The system prompt is the component's contract, not chit-chat. You're not conversing; you're defining a function in English. It must state the job ("identify spans likely