Wikimedia Enterprise API v2
- Authentication:
https://auth.enterprise.wikimedia.com/v1/login - Snapshots API:
https://api.enterprise.wikimedia.com/v2/snapshots - Download endpoint:
https://api.enterprise.wikimedia.com/v2/snapshots/{wiki_identifier}/chunks/{chunk_id}/download
Only namespace 0 (main articles) is downloaded — excluding talk pages, user pages, etc.
The pipeline has 3 main stages:
| Filter | Why |
|---|---|
| Redirect pages | Detected via #redirect in wikitext or mw:PageProp/redirect in HTML. These are just pointers to other pages, not real content. |
| Disambiguation pages | Identified using Wikidata IDs (queried from wd:Q4167410 and wd:Q15623926). These are index pages listing multiple meanings, not actual articles. |
| Empty HTML | Some records arrive without HTML body — dropped. |
| Reference/Citation sections | Sections like "See also", "References", "External links" are excluded. These are link-heavy but content-light. Compiled per-language (e.g., Arabic, Chinese, Kurdish equivalents). |
What it extracts:
- Clean plaintext with Markdown-style headers (
# Title,## Section) - Infoboxes as structured JSON (key-value pairs from sidebars)
- Flags articles containing math formulas
| Filter | Why |
|---|---|
| Very short articles | Less than 20 characters or <2 lines with no infobox — likely stubs or empty pages. |
| Wrong script | Uses GlotLID to detect script. E.g., if Arabic Wikipedia has content in Latin script, it's likely spam or misplaced content. |
| English leak in non-English wikis | For wikis like French or Spanish, filters out articles that are >70% English (probably untranslated imports). |
Each article is saved with these fields:
{
"url": "https://en.wikipedia.org/wiki/Example",
"title": "Example",
"text": "# Example\n\nAn example is a representative...",
"id": "enwiki/12345",
"language_code": "en",
"wikidata_id": "Q123456",
"bytes_html": 45678,
"wikitext": "{{Infobox...}}...",
"version": 987654321,
"infoboxes": "[{\"title\": \"Person\", \"data\": {\"Born\": \"1990\", \"Occupation\": \"Engineer\"}}]",
"has_math": false
}Key fields explained:
| Field | Description |
|---|---|
text |
Clean, readable text with section headers. Tables converted to Markdown. |
infoboxes |
Structured data from sidebar boxes (birth dates, locations, statistics, etc.) |
wikidata_id |
Links to Wikidata for entity resolution |
has_math |
Useful for filtering scientific/mathematical content |
wikitext |
Original source (preserved for reference) |
What this does: Takes Wikipedia's raw HTML dumps and converts them into clean, machine-readable text suitable for training AI models.
Why the cleaning?
- Redirects & disambiguations → These aren't real articles, just navigation aids
- Reference sections → Mostly links, citations, and boilerplate — adds noise, not knowledge
- Script/language filtering → Ensures French Wikipedia content is actually French, not English text that slipped through
- Infobox extraction → Those sidebar boxes ("Born: 1990, Occupation: Engineer") contain highly structured facts — great for knowledge graphs
The end result: Clean Wikipedia text + structured infobox data, ready for AI training or knowledge extraction. Currently uploaded to HuggingFace as josancamon/finewiki.
- English Wikipedia alone: ~144 GB of raw HTML (split into 100+ chunks)
- Processing uses parallel workers with timeout handling for problematic pages
- Outputs stored in parquet format with 500MB shards