Extract structured Markdown-formatted conversations from exported LLM chat HTML files (e.g., ChatGPT). Outputs a clean JSON
file with user/assistant roles, turns, and content.
- Parses exported chat
.html
files - Detects roles via visually hidden headings (
You said:
,ChatGPT said:
) - Extracts Markdown content per turn
- Outputs structured JSON with metadata and role-based turns
- Logs skipped/empty entries to stderr
- Python 3.8+
markdownify
beautifulsoup4
Install dependencies:
pip install markdownify beautifulsoup4
python extract_llm_conversation.py path/to/chat.html
Outputs path/to/chat_conversation.json
with the format:
{
"metadata": {
"converted_at": "...",
"source_file": "chat.html",
"format": "llm_conversation_markdown_v2",
"turns": 66
},
"conversation": [
{
"role": "user",
"turn": 1,
"content": "Hello!"
},
{
"role": "assistant",
"turn": 2,
"content": "Hi there! How can I help?"
}
...
]
}
CC-BY Β© Christian Prior-Mamulyan
Contact: [email protected]