A data-driven methodology for preparing HL7v2-to-FHIR mappings. Core principle: look at the data first, then read the spec.
- Directory with raw HL7v2 messages (
.txt/.hl7) - Target: FHIR R4
Understand what you're dealing with before parsing anything.
- What sending applications / facilities? (MSH.3, MSH.4)
- What message types? (ADT^A01, ORM^O01, ...)
- What HL7 versions? (2.3, 2.5, 2.5.1)
- What segments appear and in what combinations?
- Volume: how many messages, how many unique patients?
- Any encryption / obfuscation / masking?
Output: inventory.md — high-level summary of the data landscape.
For each segment (PID, PV1, NK1, OBX, DG1, ...):
- Parse all fields into a flat CSV (one row per segment occurrence)
- For each field calculate:
- Fill rate (% non-empty)
- Unique value count
- Top-N values (N=5-10)
- Data type (from HL7v2 spec)
- Flag anomalies: encrypted fields, unexpected values, encoding issues
Output: {segment}.csv + segment-profiles.md
Tool: Bun + DuckDB script. Parse messages → CSV → DuckDB for analytics → Markdown report.
For each populated field with a composite data type:
| Data Type | Components | Used In |
|---|---|---|
| CX | ID, check digit, authority, type, facility | PID.2, PID.3, PID.4, PID.18 |
| XPN | Family, given, middle, suffix, prefix, degree, type | PID.5, PID.6, PID.9 |
| XAD | Street, city, state, zip, country, type | PID.11 |
| XTN | Number, use code, equipment type, email | PID.13, PID.14 |
| CE/CWE | Code, text, coding system | PID.10, PID.15, PID.16, PID.22 |
| XCN | ID, family, given, degree, authority | PV1.7, PV1.8, PV1.9, PV1.17 |
| PL | Point of care, room, bed, facility | PV1.3, PV1.6 |
| FC | Financial class code, effective date | PV1.20 |
For each:
- Split into components and subcomponents (^ and & separators)
- Handle repeats (~ separator)
- Profile each component (fill rate, values)
- Cross-tabulate where meaningful (e.g., CX: type x authority matrix)
Output: {segment}_{field}_{type}.csv + {field}-report.md
For each segment-to-resource mapping:
- Read FHIR spec — official v2-to-FHIR mapping tables (e.g., CX → Identifier)
- Read community guidance — chat.fhir.org for real-world edge cases
- Read HL7v2 spec — CH02A (data types), relevant chapter for segment definitions
- Confront spec with data — does the official mapping work with what we actually see?
- Decide edge cases — document each decision with rationale
Output: {field}-mapping.md with:
- Official spec mapping table
- Community insights
- Per-type-code mapping examples (JSON)
- Decision log for edge cases
- Write TypeScript functions:
parse{Type}(raw: string)— HL7v2 composite type → structured objectmap{Segment}To{Resource}(segment)— segment → FHIR resource
- Run on real data, inspect output
- Validate against FHIR profiles
- Iterate
Output: mapping.ts + sample FHIR JSON output
The spec describes all possible fields. A real installation uses 20% of them. Fill rate is the primary filter — don't map what isn't there.
| Fill Rate | Action |
|---|---|
| 0% | Skip |
| < 1% | Investigate — anomaly or rare but important? |
| 1-50% | Map, document when it's populated |
| > 50% | Map, this is core data |
| 100% | Map, required field |
CX, XPN, XAD each have their own mapping logic. Don't try to map everything simultaneously.
For each field there's a choice:
- Build
identifier.systemfrom CX.4 or CX.6? - Is
assignerneeded? - What URI scheme for systems?
Document each decision and why.
A mapping isn't done until you've run it on real messages and looked at the FHIR JSON.
PID --> Patient
.identifier <-- PID.3 (CX)
.name <-- PID.5 (XPN)
.birthDate <-- PID.7
.gender <-- PID.8
.address <-- PID.11 (XAD)
.telecom <-- PID.13, PID.14 (XTN)
.maritalStatus <-- PID.16 (CE)
.communication <-- PID.15 (CE)
.extension[race] <-- PID.10 (CE)
.extension[ethnicity] <-- PID.22 (CE)
PV1 --> Encounter
.class <-- PV1.2
.location <-- PV1.3 (PL)
.participant <-- PV1.7, PV1.8, PV1.9, PV1.17 (XCN)
.period <-- PV1.44, PV1.45
.identifier <-- PV1.19 (CX)
NK1 --> RelatedPerson
.relationship <-- NK1.3 (CE)
.name <-- NK1.2 (XPN)
.telecom <-- NK1.5 (XTN)
DG1 --> Condition
.code <-- DG1.3 (CE)
.onsetDateTime <-- DG1.5
OBX --> Observation
.code <-- OBX.3 (CE)
.value[x] <-- OBX.5 (varies)
.status <-- OBX.11
IN1 --> Coverage
.payor <-- IN1.3 (CX)
.period <-- IN1.12, IN1.13
.subscriber <-- IN1.16 (XPN)
AL1 --> AllergyIntolerance
.code <-- AL1.3 (CE)
.type <-- AL1.2
MSH --> MessageHeader (metadata)
EVN --> Provenance (event metadata)
- Runtime: Bun (fast TypeScript execution)
- Analytics: DuckDB (in-memory SQL on CSV)
- FHIR reference: fhir-faq skill (spec search + chat.fhir.org)
- HL7v2 reference: HL7 v2.5 PDF spec (CH02A for data types, CH03 for ADT segments)
- Output format: Markdown reports + CSV data files