fredriccliver · November 24, 2024 18:11
diff --git a/content-extraction-comparison.csv b/content-extraction-comparison.csv
Feature	Mozilla Readability	Article-extractor	Node-unfluff	HTML-to-text
Content Quality	Excellent - clean and complete	Very Good - with HTML markup	Good - some formatting issues	Poor - noisy
Structure Preservation	✅ Excellent hierarchy	✅ Good with HTML tags	⚠️ Partial	❌ Poor
Formatting	✅ Clean paragraphs and headers	✅ HTML formatted	⚠️ Inconsistent	❌ Many line breaks
Metadata	Title/excerpt	Title/author/date	Title/description/author/date	None
Navigation/UI Removal	✅ Complete removal	✅ Good removal	⚠️ Some remnants	❌ Contains UI elements
Signal-to-noise Ratio	High	High	Medium	Low
Text Cleanliness	Very clean	Clean with HTML	Moderately clean	Very noisy
Link Handling	✅ Preserved appropriately	✅ HTML links	⚠️ Raw URLs	❌ Raw URLs scattered
Output Format	Plain text with structure	HTML	Plain text	Plain text
Whitespace Handling	✅ Consistent	✅ HTML-managed	⚠️ Inconsistent	❌ Excessive breaks