wiki-kg stanford pipeline explained

1. API Endpoint Used

Wikimedia Enterprise API v2

Authentication: https://auth.enterprise.wikimedia.com/v1/login
Snapshots API: https://api.enterprise.wikimedia.com/v2/snapshots
Download endpoint: https://api.enterprise.wikimedia.com/v2/snapshots/{wiki_identifier}/chunks/{chunk_id}/download

Only namespace 0 (main articles) is downloaded — excluding talk pages, user pages, etc.

2. Data Cleaning Process

The pipeline has 3 main stages:

Stage 1: Extraction (HTML → Clean Text)

Filter	Why
Redirect pages	Detected via `#redirect` in wikitext or `mw:PageProp/redirect` in HTML. These are just pointers to other pages, not real content.
Disambiguation pages	Identified using Wikidata IDs (queried from `wd:Q4167410` and `wd:Q15623926`). These are index pages listing multiple meanings, not actual articles.
Empty HTML	Some records arrive without HTML body — dropped.
Reference/Citation sections	Sections like "See also", "References", "External links" are excluded. These are link-heavy but content-light. Compiled per-language (e.g., Arabic, Chinese, Kurdish equivalents).

What it extracts:

Clean plaintext with Markdown-style headers (# Title, ## Section)
Infoboxes as structured JSON (key-value pairs from sidebars)
Flags articles containing math formulas

Stage 2: Language/Script Filtering

Filter	Why
Very short articles	Less than 20 characters or <2 lines with no infobox — likely stubs or empty pages.
Wrong script	Uses GlotLID to detect script. E.g., if Arabic Wikipedia has content in Latin script, it's likely spam or misplaced content.
English leak in non-English wikis	For wikis like French or Spanish, filters out articles that are >70% English (probably untranslated imports).

3. Output Data Format

Each article is saved with these fields:

{
  "url": "https://en.wikipedia.org/wiki/Example",
  "title": "Example",
  "text": "# Example\n\nAn example is a representative...",
  "id": "enwiki/12345",
  "language_code": "en",
  "wikidata_id": "Q123456",
  "bytes_html": 45678,
  "wikitext": "{{Infobox...}}...",
  "version": 987654321,
  "infoboxes": "[{\"title\": \"Person\", \"data\": {\"Born\": \"1990\", \"Occupation\": \"Engineer\"}}]",
  "has_math": false
}

Key fields explained:

Field	Description
`text`	Clean, readable text with section headers. Tables converted to Markdown.
`infoboxes`	Structured data from sidebar boxes (birth dates, locations, statistics, etc.)
`wikidata_id`	Links to Wikidata for entity resolution
`has_math`	Useful for filtering scientific/mathematical content
`wikitext`	Original source (preserved for reference)

4. Summary

What this does: Takes Wikipedia's raw HTML dumps and converts them into clean, machine-readable text suitable for training AI models.

Why the cleaning?

Redirects & disambiguations → These aren't real articles, just navigation aids
Reference sections → Mostly links, citations, and boilerplate — adds noise, not knowledge
Script/language filtering → Ensures French Wikipedia content is actually French, not English text that slipped through
Infobox extraction → Those sidebar boxes ("Born: 1990, Occupation: Engineer") contain highly structured facts — great for knowledge graphs

The end result: Clean Wikipedia text + structured infobox data, ready for AI training or knowledge extraction. Currently uploaded to HuggingFace as josancamon/finewiki.

5. Scale

English Wikipedia alone: ~144 GB of raw HTML (split into 100+ chunks)
Processing uses parallel workers with timeout handling for problematic pages
Outputs stored in parquet format with 500MB shards

josancamon19/contents.md

Select an option

No results found