Skip to content

Instantly share code, notes, and snippets.

@josancamon19
Last active January 11, 2026 19:28
Show Gist options
  • Select an option

  • Save josancamon19/d14b800fd06cddcb2685a69b5d78a66f to your computer and use it in GitHub Desktop.

Select an option

Save josancamon19/d14b800fd06cddcb2685a69b5d78a66f to your computer and use it in GitHub Desktop.
wiki-kg stanford pipeline explained

1. API Endpoint Used

Wikimedia Enterprise API v2

  • Authentication: https://auth.enterprise.wikimedia.com/v1/login
  • Snapshots API: https://api.enterprise.wikimedia.com/v2/snapshots
  • Download endpoint: https://api.enterprise.wikimedia.com/v2/snapshots/{wiki_identifier}/chunks/{chunk_id}/download

Only namespace 0 (main articles) is downloaded — excluding talk pages, user pages, etc.


2. Data Cleaning Process

The pipeline has 3 main stages:

Stage 1: Extraction (HTML → Clean Text)

Filter Why
Redirect pages Detected via #redirect in wikitext or mw:PageProp/redirect in HTML. These are just pointers to other pages, not real content.
Disambiguation pages Identified using Wikidata IDs (queried from wd:Q4167410 and wd:Q15623926). These are index pages listing multiple meanings, not actual articles.
Empty HTML Some records arrive without HTML body — dropped.
Reference/Citation sections Sections like "See also", "References", "External links" are excluded. These are link-heavy but content-light. Compiled per-language (e.g., Arabic, Chinese, Kurdish equivalents).

What it extracts:

  • Clean plaintext with Markdown-style headers (# Title, ## Section)
  • Infoboxes as structured JSON (key-value pairs from sidebars)
  • Flags articles containing math formulas

Stage 2: Language/Script Filtering

Filter Why
Very short articles Less than 20 characters or <2 lines with no infobox — likely stubs or empty pages.
Wrong script Uses GlotLID to detect script. E.g., if Arabic Wikipedia has content in Latin script, it's likely spam or misplaced content.
English leak in non-English wikis For wikis like French or Spanish, filters out articles that are >70% English (probably untranslated imports).

3. Output Data Format

Each article is saved with these fields:

{
  "url": "https://en.wikipedia.org/wiki/Example",
  "title": "Example",
  "text": "# Example\n\nAn example is a representative...",
  "id": "enwiki/12345",
  "language_code": "en",
  "wikidata_id": "Q123456",
  "bytes_html": 45678,
  "wikitext": "{{Infobox...}}...",
  "version": 987654321,
  "infoboxes": "[{\"title\": \"Person\", \"data\": {\"Born\": \"1990\", \"Occupation\": \"Engineer\"}}]",
  "has_math": false
}

Key fields explained:

Field Description
text Clean, readable text with section headers. Tables converted to Markdown.
infoboxes Structured data from sidebar boxes (birth dates, locations, statistics, etc.)
wikidata_id Links to Wikidata for entity resolution
has_math Useful for filtering scientific/mathematical content
wikitext Original source (preserved for reference)

4. Summary

What this does: Takes Wikipedia's raw HTML dumps and converts them into clean, machine-readable text suitable for training AI models.

Why the cleaning?

  • Redirects & disambiguations → These aren't real articles, just navigation aids
  • Reference sections → Mostly links, citations, and boilerplate — adds noise, not knowledge
  • Script/language filtering → Ensures French Wikipedia content is actually French, not English text that slipped through
  • Infobox extraction → Those sidebar boxes ("Born: 1990, Occupation: Engineer") contain highly structured facts — great for knowledge graphs

The end result: Clean Wikipedia text + structured infobox data, ready for AI training or knowledge extraction. Currently uploaded to HuggingFace as josancamon/finewiki.


5. Scale

  • English Wikipedia alone: ~144 GB of raw HTML (split into 100+ chunks)
  • Processing uses parallel workers with timeout handling for problematic pages
  • Outputs stored in parquet format with 500MB shards
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment