Knowledge Graph Extraction for Embedding Models

A pattern for building temporally-grounded knowledge graphs from scholarly and documentary sources, optimised for downstream knowledge graph embedding (KGE) training.

The core problem

Most knowledge graph construction from text produces triples: (subject, predicate, object). Applied to scholarly or historical sources, this collapses a critical distinction. A monograph arguing that the Treaty of Westphalia established state sovereignty is doing something categorically different from the same monograph reporting that previous historians have made this argument. The first is a knowledge claim being advanced; the second is a claim being attributed. Embedding models trained on undifferentiated triples will represent both as equivalent facts which could create a fundamental error of provenance.

A second problem: triples are atemporal. (Napoleon, commanded, Grand Armée) is true for a bounded interval. Without a temporal anchor, the triple cannot be placed on a timeline, cannot be compared with contradicting triples from different periods, and cannot serve as useful training data for temporally-aware embedding methods such as TComplEx, TNTComplEx, or DE-SimplE. The date is not metadata about the extraction but is instead a first-class component of the knowledge claim itself.

This pattern addresses both problems. Every extracted unit of knowledge is a quad, rather than a triple, and every quad carries a provenance class distinguishing what the source is claiming from what it is reporting others have claimed.

The argument / historiography distinction

Every source operates in two epistemic registers simultaneously. The LLM must maintain strict separation between them throughout extraction.

ARGUMENT register — the source's own claims. Statements the author is asserting as true, conclusions drawn from evidence, interpretations advanced, causal explanations offered. The epistemic agent is the source's author. Provenance: ARG.

HISTORIOGRAPHY register — claims the source attributes to prior scholarship, tradition, or received consensus. Statements introduced with constructions like "historians have argued," "the standard view holds," "Smith (1970) demonstrated," or "it has long been assumed." The epistemic agent is not the author but the cited party. Provenance: HIST.

The distinction matters for three reasons:

First, epistemic weight differs. An ARG quad represents what this source is staking its credibility on. A HIST quad represents what the source takes as established or contested background. Embedding models using these as training data should not treat them identically.

Second, contradiction patterns differ. Two sources may both carry HIST quads attributing the same claim to the same scholar making them concordant. Two sources carrying ARG quads asserting incompatible propositions create tension. A lint pass that conflates the two will miss real contradictions and flag false ones.

Third, historiography is layered. A source may argue (ARG) that prior historians were wrong in their interpretation (HIST), and may cite a third party to support this critique. Each layer requires its own provenance. The LLM must not flatten this structure.

The temporal element

Every quad takes the form:

(subject, predicate, object, date, provenance)

date is the time germane to the knowledge claim: the interval or point in time during which the stated relationship held, or to which the claim refers. It is never the date of extraction, and never the publication date of the source unless the publication date is itself the referent of the claim.

Date representation should follow a fixed schema agreed at initialisation. A recommended baseline:

YYYY              exact year
YYYY-MM           year and month
YYYY/YYYY         closed interval (inclusive)
YYYY/             open interval (ongoing from year)
~YYYY             circa (approximate, within ±10 years by default)
~YYYY/YYYY        approximate interval
CENT:NN           century (e.g. CENT:19 = 19th century)
CENT_BCE:NN		  centry BCE (e.g., CENT_BCE:2 = 2nd century, 200 - 100 BCE)
NULL              date genuinely unknown and uninferable

The LLM must not invent dates. If a date cannot be determined from the source or from secure background knowledge, the value is NULL. A NULL date is not an extraction failure. It is information. NULL-bearing quads remain useful for non-temporal embedding tasks and can be filtered for temporal tasks.

Date assignment is substantive work, not post-processing. For ARG quads, the date refers to when the argued relationship obtained — not when the author wrote. For HIST quads, the date refers to the time the attributed claim concerns, not when the cited scholar published. The LLM must reason explicitly about this distinction on every quad.

Architecture

Four layers, two of which are persistent.

Raw sources — immutable source documents. Monographs, articles, primary sources, diplomatic records, transcripts, data tables. The LLM reads from these; it does not modify them. Each source receives a stable identifier used in all provenance records.

The quad store — the primary persistent output. A structured collection of extracted quads in a consistent serialisation format (see below). The LLM appends to this store on each extraction pass; it may revise existing quads only through an explicit correction operation with a logged rationale. The store is the training corpus for downstream embedding models and must be treated with corresponding discipline.

The source registry — a persistent catalogue mapping source identifiers to bibliographic metadata: author, title, date of publication, type (monograph / article / primary source / dataset), and a brief characterisation of its argument and methodological stance. The registry is the index for the provenance graph that connects quads to their sources.

The schema — the configuration document governing extraction conventions. Defines the date schema, the predicate vocabulary (see below), any domain-specific entity types, and the protocols for handling edge cases. The schema is co-evolved by the human and LLM over the project's lifetime and committed to version control with the quad store.

The quad format

Each quad is a record with the following fields:

id          Stable unique identifier (e.g. UUID or incrementing integer)
s           Subject entity (normalised string or entity ID)
p           Predicate (controlled vocabulary term)
o           Object entity or literal value (normalised)
date        Temporal anchor (schema above)
prov        Provenance class: ARG or HIST
source      Source identifier from registry
cited_agent For HIST quads: the agent to whom the claim is attributed
confidence  Optional: HIGH / MED / LOW (LLM's assessment of extraction certainty)
note        Optional: free-text rationale for non-obvious extractions

A recommended serialisation is JSONL (one JSON object per line), which is trivially appendable, diff-friendly in version control, and directly ingestible by most KGE preprocessing pipelines. TSV is an acceptable alternative for simpler schemas.

Normalisation is mandatory. Entity strings must be normalised to a canonical form before storage. Napoleon Bonaparte, Napoleon I, Napoleon, and Bonaparte should resolve to a single entity identifier. The LLM should maintain a normalisation table in the schema document and apply it consistently. Failures here fragment the entity space and corrupt embedding quality.

Predicate vocabulary must be closed and controlled. Open-ended predicate extraction produces an unmanageable long tail of near-synonymous relations that destroys embedding model performance. The schema must define a finite predicate set. A minimal starting set for historical/scholarly graphs:

argued            X argued [claim] / author-claim relation (use for ARG provenance)
attributed_to     Claim attributed to [agent] (use for HIST provenance)
caused            X caused Y
enabled           X enabled Y
occurred_at       X occurred at [location]
occurred_during   X occurred during [period]
succeeded         X succeeded Y (temporal sequence)
preceded          X preceded Y
led               X led [organisation/group]
participated_in   X participated in [event]
governed          X governed [territory/institution]
allied_with       X allied with Y (reciprocal)
opposed           X opposed Y
part_of           X is part of Y (structural)
instance_of       X is an instance of [type]
authored          X authored [work]
influenced        X influenced Y
reversed          X reversed [prior state/claim]

The vocabulary is extended, not replaced, as new domains require new relations. Extension decisions are logged in the schema with rationale.

Grammatical Sense and Semantic Direction

The sequence of (subject, predicate, object) must reflect the logical direction of the action, not the word order of the source text.

Subject as Agent: The subject must always be the entity performing the action or exerting the influence (the Agent).
Object as Patient: The object must be the entity receiving the action or being described (the Patient).
Passive-to-Active Transformation: The LLM must automatically convert passive-voice source sentences into active-voice quads. If the source says "The Edict was revoked by the King," the quad must be (King, reversed, Edict).
The S-P-O Logic Test: Before finalizing a quad, the LLM must ask: "Is the Subject the one doing the [Predicate] to the Object?" If the answer is no, the roles must be swapped.

Operations

Extract. The LLM reads a source and generates quads. The recommended workflow:

Read the source. Identify its central argument. Record a précis in the source registry.
Extract ARG quads. For each major claim, ask: what entities are in relation, what is the relation, when did it obtain? Assign dates explicitly.
Extract HIST quads. For each claim the source attributes to prior scholarship, ask: who is the cited agent, what did they claim, what time does the claim concern? Assign dates to the knowledge claim's referent, not to the citation act.
Normalise entities against the existing normalisation table. Add new canonical forms where required.
Assign predicates from the controlled vocabulary. If no existing predicate fits, flag for schema review; do not invent an ad-hoc predicate.
Append to the quad store.
Log the extraction in the log file.

A single dense source may yield 50-200 quads across both provenance classes. The LLM should not attempt to be exhaustive on the first pass. Instead it should capture the most structurally significant claims. Marginal claims can be added in subsequent passes.

Validate. After extraction, run a validation pass:

Do all quads reference normalised entities?
Are all predicates from the controlled vocabulary?
Do all dates conform to the date schema?
Are all HIST quads populated with a cited_agent?
Are there entity strings in the new quads that appear to be near-duplicates of existing entities (suggesting a normalisation miss)?
Do all quads meet the check for grammatical sense and semantic direction?

Validation failures are corrected before the quad store is committed.

Reconcile. Periodically, run a reconciliation pass across the quad store:

Identify ARG quads from different sources that assert incompatible relationships between the same entities in the same time period. Log these as explicit contradictions — they are not errors to be resolved but structured data about scholarly disagreement.
Identify HIST quads that attribute a claim to an agent, and check whether that agent's own source (if it exists in the registry) corroborates the attribution. Flag discrepancies.
Identify entities with very high quad counts — these are hubs. Verify their normalisation is consistent.
Identify NULL-dated quads where other quads in the store could constrain the date by inference.

Reconciliation outputs are appended to the log and, where they modify existing quads, recorded with rationale.

Prepare for embedding. Before passing the quad store to a KGE pipeline, a preparation pass is required:

Split into temporal and non-temporal subsets (the KGE method may differ between them).
Decide how to handle ARG vs HIST provenance. Options include: train separate embeddings, include provenance as a relation modifier, or train on both with a provenance-type auxiliary task.
Decide how to handle NULL dates. Typically these are included in non-temporal training but excluded from temporal evaluation benchmarks.
Compile entity and relation dictionaries and map all strings to integer indices.
Produce train/validation/test splits. For temporally-grounded graphs, splits should respect temporal ordering, where test quads should be from later periods than training quads where possible.

Argument–historiography interaction patterns

Several common patterns require specific handling:

The reversal. A source argues (ARG) that a historiographical consensus (HIST) is wrong. Extract both: the HIST quad capturing what the consensus holds, and the ARG quad capturing the source's counter-claim. Add an ARG quad with predicate reversed linking the source's argument to the prior consensus claim.

The synthesis. A source integrates multiple prior claims into a new position. The prior claims are HIST; the integration itself is ARG. Do not merge them into a single quad.

The attribution chain. A source cites Scholar B citing Scholar C. Extract the outermost attribution as HIST with cited_agent = B. If the content of C's claim is also stated, extract it as a separate HIST quad with cited_agent = C. Flag the chain relationship in the note field.

The implicit argument. A source advances a claim without explicit argumentation — it is simply asserted as fact. This is still ARG provenance; the absence of explicit argumentation does not make it HIST. HIST is reserved for claims attributed to a named or characterised prior party.

The contested date. The source itself disputes a conventional date for an event. Extract the conventional date as a HIST quad (attributed to the prior tradition), and the proposed date as an ARG quad. The contradiction is structural information.

Indexing and logging

entity-index.md — a catalogue of all normalised entities in the quad store, each with its canonical string, any known aliases, entity type, and a count of quads in which it participates. Updated on each extraction pass. The LLM reads this before normalisation to check for existing canonical forms.

predicate-index.md — the current controlled vocabulary with definitions and example quads for each predicate. Extended through the schema review process.

log.md — append-only chronological record. Each entry: timestamp, operation type (extract / validate / reconcile / schema-update), source identifier if applicable, count of quads added or modified, and a brief summary. Recommended prefix format: ## [YYYY-MM-DD] extract | source-id | N quads added.

Why the provenance split is load-bearing for embedding models

Knowledge graph embedding methods (TransE, RotatE, ComplEx, TComplEx, and their successors) learn entity and relation representations by fitting a scoring function to observed triples (or quads). The training signal is the set of facts provided. If HIST quads, which represent what one scholar said another scholar argued, are indistinguishable from ARG quads, the embedding model will represent epistemic attribution as ontological fact. The entity representations will be corrupted by conflated claims about the past with claims about what scholars have said about the past.

Keeping provenance explicit allows three downstream strategies. The simplest is filtering: train only on ARG quads and use HIST quads for evaluation or auxiliary tasks. The second is reification: add a provenance relation and make the quad's provenance class part of the relational structure. The third is multi-task training: use provenance class as a secondary classification signal alongside the primary link-prediction objective. Which strategy is appropriate depends on the research question. The pattern makes all three available; an undifferentiated extraction forecloses two of them.

Notes

This document describes the pattern, not a specific implementation. Date schemas, predicate vocabularies, serialisation formats, and embedding pipeline choices are all domain-dependent. The schema document, maintained alongside the quad store, is where these choices are recorded and versioned. The LLM and the researcher co-evolve it. This document's only function is to communicate the structural commitments that make the pattern coherent: the provenance split, the temporal grounding, the controlled predicate vocabulary, and the normalisation discipline. These are not optional. Everything else is configuration.

shawngraham/LLM-KG-Extractor_Pattern.md

Select an option

No results found