Skip to content

Instantly share code, notes, and snippets.

@fredriccliver
Created November 24, 2024 18:11
Show Gist options
  • Save fredriccliver/292a9090192d92d2056fddfd237545c5 to your computer and use it in GitHub Desktop.
Save fredriccliver/292a9090192d92d2056fddfd237545c5 to your computer and use it in GitHub Desktop.
Web Content Extraction for LLM Context Augmentation: A Comparative Analysis
Feature Mozilla Readability Article-extractor Node-unfluff HTML-to-text
Content Quality Excellent - clean and complete Very Good - with HTML markup Good - some formatting issues Poor - noisy
Structure Preservation ✅ Excellent hierarchy ✅ Good with HTML tags ⚠️ Partial ❌ Poor
Formatting ✅ Clean paragraphs and headers ✅ HTML formatted ⚠️ Inconsistent ❌ Many line breaks
Metadata Title/excerpt Title/author/date Title/description/author/date None
Navigation/UI Removal ✅ Complete removal ✅ Good removal ⚠️ Some remnants ❌ Contains UI elements
Signal-to-noise Ratio High High Medium Low
Text Cleanliness Very clean Clean with HTML Moderately clean Very noisy
Link Handling ✅ Preserved appropriately ✅ HTML links ⚠️ Raw URLs ❌ Raw URLs scattered
Output Format Plain text with structure HTML Plain text Plain text
Whitespace Handling ✅ Consistent ✅ HTML-managed ⚠️ Inconsistent ❌ Excessive breaks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment