Skip to content

Instantly share code, notes, and snippets.

@niquola
Created March 6, 2026 12:23
Show Gist options
  • Select an option

  • Save niquola/9e8987fc77b484f6cf57b5b49e63b604 to your computer and use it in GitHub Desktop.

Select an option

Save niquola/9e8987fc77b484f6cf57b5b49e63b604 to your computer and use it in GitHub Desktop.
HL7v2 Mapping Preparation — data-driven methodology for v2-to-FHIR mappings

HL7v2 Mapping Preparation — Methodology

A data-driven methodology for preparing HL7v2-to-FHIR mappings. Core principle: look at the data first, then read the spec.

Input

  • Directory with raw HL7v2 messages (.txt / .hl7)
  • Target: FHIR R4

Phases

Phase 1: INVENTORY

Understand what you're dealing with before parsing anything.

  • What sending applications / facilities? (MSH.3, MSH.4)
  • What message types? (ADT^A01, ORM^O01, ...)
  • What HL7 versions? (2.3, 2.5, 2.5.1)
  • What segments appear and in what combinations?
  • Volume: how many messages, how many unique patients?
  • Any encryption / obfuscation / masking?

Output: inventory.md — high-level summary of the data landscape.

Phase 2: SEGMENT PROFILING

For each segment (PID, PV1, NK1, OBX, DG1, ...):

  • Parse all fields into a flat CSV (one row per segment occurrence)
  • For each field calculate:
    • Fill rate (% non-empty)
    • Unique value count
    • Top-N values (N=5-10)
    • Data type (from HL7v2 spec)
  • Flag anomalies: encrypted fields, unexpected values, encoding issues

Output: {segment}.csv + segment-profiles.md

Tool: Bun + DuckDB script. Parse messages → CSV → DuckDB for analytics → Markdown report.

Phase 3: COMPOSITE TYPE DECOMPOSITION

For each populated field with a composite data type:

Data Type Components Used In
CX ID, check digit, authority, type, facility PID.2, PID.3, PID.4, PID.18
XPN Family, given, middle, suffix, prefix, degree, type PID.5, PID.6, PID.9
XAD Street, city, state, zip, country, type PID.11
XTN Number, use code, equipment type, email PID.13, PID.14
CE/CWE Code, text, coding system PID.10, PID.15, PID.16, PID.22
XCN ID, family, given, degree, authority PV1.7, PV1.8, PV1.9, PV1.17
PL Point of care, room, bed, facility PV1.3, PV1.6
FC Financial class code, effective date PV1.20

For each:

  • Split into components and subcomponents (^ and & separators)
  • Handle repeats (~ separator)
  • Profile each component (fill rate, values)
  • Cross-tabulate where meaningful (e.g., CX: type x authority matrix)

Output: {segment}_{field}_{type}.csv + {field}-report.md

Phase 4: MAPPING RESEARCH

For each segment-to-resource mapping:

  1. Read FHIR spec — official v2-to-FHIR mapping tables (e.g., CX → Identifier)
  2. Read community guidance — chat.fhir.org for real-world edge cases
  3. Read HL7v2 spec — CH02A (data types), relevant chapter for segment definitions
  4. Confront spec with data — does the official mapping work with what we actually see?
  5. Decide edge cases — document each decision with rationale

Output: {field}-mapping.md with:

  • Official spec mapping table
  • Community insights
  • Per-type-code mapping examples (JSON)
  • Decision log for edge cases

Phase 5: MAPPING CODE

  1. Write TypeScript functions:
    • parse{Type}(raw: string) — HL7v2 composite type → structured object
    • map{Segment}To{Resource}(segment) — segment → FHIR resource
  2. Run on real data, inspect output
  3. Validate against FHIR profiles
  4. Iterate

Output: mapping.ts + sample FHIR JSON output

Key Principles

1. Data before spec

The spec describes all possible fields. A real installation uses 20% of them. Fill rate is the primary filter — don't map what isn't there.

2. Fill rate drives priority

Fill Rate Action
0% Skip
< 1% Investigate — anomaly or rare but important?
1-50% Map, document when it's populated
> 50% Map, this is core data
100% Map, required field

3. One composite type at a time

CX, XPN, XAD each have their own mapping logic. Don't try to map everything simultaneously.

4. Mapping = decision, not translation

For each field there's a choice:

  • Build identifier.system from CX.4 or CX.6?
  • Is assigner needed?
  • What URI scheme for systems?

Document each decision and why.

5. Validate with real output

A mapping isn't done until you've run it on real messages and looked at the FHIR JSON.

Segment-to-Resource Map

PID  --> Patient
         .identifier  <-- PID.3 (CX)
         .name        <-- PID.5 (XPN)
         .birthDate   <-- PID.7
         .gender      <-- PID.8
         .address     <-- PID.11 (XAD)
         .telecom     <-- PID.13, PID.14 (XTN)
         .maritalStatus     <-- PID.16 (CE)
         .communication     <-- PID.15 (CE)
         .extension[race]   <-- PID.10 (CE)
         .extension[ethnicity] <-- PID.22 (CE)

PV1  --> Encounter
         .class       <-- PV1.2
         .location    <-- PV1.3 (PL)
         .participant <-- PV1.7, PV1.8, PV1.9, PV1.17 (XCN)
         .period      <-- PV1.44, PV1.45
         .identifier  <-- PV1.19 (CX)

NK1  --> RelatedPerson
         .relationship <-- NK1.3 (CE)
         .name         <-- NK1.2 (XPN)
         .telecom      <-- NK1.5 (XTN)

DG1  --> Condition
         .code        <-- DG1.3 (CE)
         .onsetDateTime <-- DG1.5

OBX  --> Observation
         .code        <-- OBX.3 (CE)
         .value[x]    <-- OBX.5 (varies)
         .status      <-- OBX.11

IN1  --> Coverage
         .payor       <-- IN1.3 (CX)
         .period      <-- IN1.12, IN1.13
         .subscriber  <-- IN1.16 (XPN)

AL1  --> AllergyIntolerance
         .code        <-- AL1.3 (CE)
         .type        <-- AL1.2

MSH  --> MessageHeader (metadata)
EVN  --> Provenance (event metadata)

Tooling

  • Runtime: Bun (fast TypeScript execution)
  • Analytics: DuckDB (in-memory SQL on CSV)
  • FHIR reference: fhir-faq skill (spec search + chat.fhir.org)
  • HL7v2 reference: HL7 v2.5 PDF spec (CH02A for data types, CH03 for ADT segments)
  • Output format: Markdown reports + CSV data files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment