HL7v2 Mapping Preparation — Methodology

A data-driven methodology for preparing HL7v2-to-FHIR mappings. Core principle: look at the data first, then read the spec.

Input

Directory with raw HL7v2 messages (.txt / .hl7)
Target: FHIR R4

Phases

Phase 1: INVENTORY

Understand what you're dealing with before parsing anything.

What sending applications / facilities? (MSH.3, MSH.4)
What message types? (ADT^A01, ORM^O01, ...)
What HL7 versions? (2.3, 2.5, 2.5.1)
What segments appear and in what combinations?
Volume: how many messages, how many unique patients?
Any encryption / obfuscation / masking?

Output: inventory.md — high-level summary of the data landscape.

Phase 2: SEGMENT PROFILING

For each segment (PID, PV1, NK1, OBX, DG1, ...):

Parse all fields into a flat CSV (one row per segment occurrence)
For each field calculate:
- Fill rate (% non-empty)
- Unique value count
- Top-N values (N=5-10)
- Data type (from HL7v2 spec)
Flag anomalies: encrypted fields, unexpected values, encoding issues

Output: {segment}.csv + segment-profiles.md

Tool: Bun + DuckDB script. Parse messages → CSV → DuckDB for analytics → Markdown report.

Phase 3: COMPOSITE TYPE DECOMPOSITION

For each populated field with a composite data type:

Data Type	Components	Used In
CX	ID, check digit, authority, type, facility	PID.2, PID.3, PID.4, PID.18
XPN	Family, given, middle, suffix, prefix, degree, type	PID.5, PID.6, PID.9
XAD	Street, city, state, zip, country, type	PID.11
XTN	Number, use code, equipment type, email	PID.13, PID.14
CE/CWE	Code, text, coding system	PID.10, PID.15, PID.16, PID.22
XCN	ID, family, given, degree, authority	PV1.7, PV1.8, PV1.9, PV1.17
PL	Point of care, room, bed, facility	PV1.3, PV1.6
FC	Financial class code, effective date	PV1.20

For each:

Split into components and subcomponents (^ and & separators)
Handle repeats (~ separator)
Profile each component (fill rate, values)
Cross-tabulate where meaningful (e.g., CX: type x authority matrix)

Output: {segment}_{field}_{type}.csv + {field}-report.md

Phase 4: MAPPING RESEARCH

For each segment-to-resource mapping:

Read FHIR spec — official v2-to-FHIR mapping tables (e.g., CX → Identifier)
Read community guidance — chat.fhir.org for real-world edge cases
Read HL7v2 spec — CH02A (data types), relevant chapter for segment definitions
Confront spec with data — does the official mapping work with what we actually see?
Decide edge cases — document each decision with rationale

Output: {field}-mapping.md with:

Official spec mapping table
Community insights
Per-type-code mapping examples (JSON)
Decision log for edge cases

Phase 5: MAPPING CODE

Write TypeScript functions:
- parse{Type}(raw: string) — HL7v2 composite type → structured object
- map{Segment}To{Resource}(segment) — segment → FHIR resource
Run on real data, inspect output
Validate against FHIR profiles
Iterate

Output: mapping.ts + sample FHIR JSON output

Key Principles

1. Data before spec

The spec describes all possible fields. A real installation uses 20% of them. Fill rate is the primary filter — don't map what isn't there.

2. Fill rate drives priority

Fill Rate	Action
0%	Skip
< 1%	Investigate — anomaly or rare but important?
1-50%	Map, document when it's populated
> 50%	Map, this is core data
100%	Map, required field

3. One composite type at a time

CX, XPN, XAD each have their own mapping logic. Don't try to map everything simultaneously.

4. Mapping = decision, not translation

For each field there's a choice:

Build identifier.system from CX.4 or CX.6?
Is assigner needed?
What URI scheme for systems?

Document each decision and why.

5. Validate with real output

A mapping isn't done until you've run it on real messages and looked at the FHIR JSON.

Segment-to-Resource Map

PID  --> Patient
         .identifier  <-- PID.3 (CX)
         .name        <-- PID.5 (XPN)
         .birthDate   <-- PID.7
         .gender      <-- PID.8
         .address     <-- PID.11 (XAD)
         .telecom     <-- PID.13, PID.14 (XTN)
         .maritalStatus     <-- PID.16 (CE)
         .communication     <-- PID.15 (CE)
         .extension[race]   <-- PID.10 (CE)
         .extension[ethnicity] <-- PID.22 (CE)

PV1  --> Encounter
         .class       <-- PV1.2
         .location    <-- PV1.3 (PL)
         .participant <-- PV1.7, PV1.8, PV1.9, PV1.17 (XCN)
         .period      <-- PV1.44, PV1.45
         .identifier  <-- PV1.19 (CX)

NK1  --> RelatedPerson
         .relationship <-- NK1.3 (CE)
         .name         <-- NK1.2 (XPN)
         .telecom      <-- NK1.5 (XTN)

DG1  --> Condition
         .code        <-- DG1.3 (CE)
         .onsetDateTime <-- DG1.5

OBX  --> Observation
         .code        <-- OBX.3 (CE)
         .value[x]    <-- OBX.5 (varies)
         .status      <-- OBX.11

IN1  --> Coverage
         .payor       <-- IN1.3 (CX)
         .period      <-- IN1.12, IN1.13
         .subscriber  <-- IN1.16 (XPN)

AL1  --> AllergyIntolerance
         .code        <-- AL1.3 (CE)
         .type        <-- AL1.2

MSH  --> MessageHeader (metadata)
EVN  --> Provenance (event metadata)

Tooling

Runtime: Bun (fast TypeScript execution)
Analytics: DuckDB (in-memory SQL on CSV)
FHIR reference: fhir-faq skill (spec search + chat.fhir.org)
HL7v2 reference: HL7 v2.5 PDF spec (CH02A for data types, CH03 for ADT segments)
Output format: Markdown reports + CSV data files

niquola/hl7v2-mapping-prep.md

Select an option

No results found