Skip to content

Instantly share code, notes, and snippets.

@evanpurkhiser
Created February 23, 2026 09:45
Show Gist options
  • Select an option

  • Save evanpurkhiser/7663b7cabf82e6483d2d2962505ccb88 to your computer and use it in GitHub Desktop.

Select an option

Save evanpurkhiser/7663b7cabf82e6483d2d2962505ccb88 to your computer and use it in GitHub Desktop.

Patching a ZFS Snapshot GUID to Avoid Retransmitting 9.5TB

After a misconfigured pruning job destroyed the only common snapshot between my local and offsite ZFS pools, I was facing a month-long full retransmit over a 4MB/s link. Instead, I patched OpenZFS's zhack tool to rewrite a snapshot's on-disk GUID, re-establishing the incremental send/recv chain in seconds.

The Disaster

I have a documents pool (~9.6TB, raidz1) backed up to an offsite tank pool over Tailscale. The initial transfer of documents@initial to tank/backups/documents-backup@initial took about a month at 4MB/s. Once complete, the plan was to use zfs send -i for fast incremental backups.

I use zrepl for automated snapshot management. My zrepl config had a pruning rule that kept snapshots matching ^zrepl_.* and deleted everything else. I hadn't noticed the implication: my manually-created @initial snapshot didn't match that regex.

For five days, zrepl tried to destroy @initial and failed — the send script still had a hold on it. The moment the send job finished and released the hold, zrepl pruned @initial. Gone in an instant.

The offsite still had tank/backups/documents-backup@initial, but the local @initial was destroyed. Without a common snapshot, zfs send -i won't work. The pools were orphaned from each other.

Failed Recovery: Pool Rewind

My first instinct was to rewind the pool. ZFS keeps a ring of uberblocks (root-of-the-Merkle-tree pointers), and zpool import -T <txg> can import at a historical transaction group. I found TXG 45195808 in the uberblock ring — timestamped just 4 minutes before the deletion.

best uberblock found for spa documents. txg 45195808
label discarded as txg is too large (45197259 > 45195808)
using uberblock with txg=45195808
FAILED: unable to retrieve MOS config

No luck. ZFS is copy-on-write: the ~1,400 TXGs of pool activity since that uberblock had recycled the MOS (Meta Object Set) blocks it pointed to. The uberblock was there, but the metadata tree beneath it was gone.

The Idea: Fake a Common Snapshot

ZFS incremental send identifies the "from" snapshot solely by GUID — a random 64-bit integer assigned at snapshot creation. It doesn't check the name, creation time, or transaction group. The receive side just walks the snapshot chain comparing GUIDs:

// module/zfs/dmu_recv.c:464-475
while (obj != 0) {
    error = dsl_dataset_hold_obj(dp, obj, FTAG, &snap);
    if (dsl_dataset_phys(snap)->ds_guid == fromguid)
        break;
    obj = dsl_dataset_phys(snap)->ds_prev_snap_obj;
    dsl_dataset_rele(snap, FTAG);
}

So if I created matching @synced snapshots on both pools with identical content and then changed one's GUID to match the other, ZFS would treat them as the same snapshot. Incremental sends would just work.

The data was already identical — the offsite was a faithful zfs send | recv copy, and almost nothing had changed on the local side since the original transfer (the zrepl snapshots showed 0B deltas for the most recent days).

I created @synced on both sides and verified they matched: same object IDs, same file sizes, same directory structure. A full checksum comparison would run overnight to be sure.

Now I just needed to change a single 8-byte field on disk.

Why You Can't Just Edit Bytes

ZFS stores everything in a Merkle tree. The snapshot's ds_guid lives in a dsl_dataset_phys_t structure (320 bytes, stored as a dnode bonus buffer in the MOS). That bonus buffer is part of a dnode block, which is checksummed by its parent block pointer, which is checksummed by its parent, all the way up to the uberblock.

Changing one byte with dd would invalidate every checksum up the tree. The pool would be corrupt.

But ZFS's own transaction machinery handles this automatically. When you modify a buffer through dmu_buf_will_dirty() and commit the transaction, ZFS does copy-on-write at every level: allocates new blocks, computes new checksums, writes new block pointers, and atomically commits a new uberblock. Raidz parity is recomputed by the ZIO pipeline as part of the normal write path.

The trick is to go through ZFS's own code, not around it.

Patching zhack

OpenZFS ships a tool called zhack — a debugging utility that can write to pool metadata using libzpool. It's used for things like enabling feature flags and repairing labels. It already has the infrastructure for safe metadata writes via dsl_sync_task().

I added a dataset set-guid subcommand. The core is about 30 lines:

static int
zhack_set_guid_check(void *arg, dmu_tx_t *tx)
{
    zhack_set_guid_arg_t *sga = arg;
    dsl_pool_t *dp = dmu_tx_pool(tx);
    dsl_dataset_t *ds;
    int error;

    error = dsl_dataset_hold(dp, sga->snap_name, FTAG, &ds);
    if (error != 0)
        return (error);

    if (!ds->ds_is_snapshot) {
        dsl_dataset_rele(ds, FTAG);
        return (SET_ERROR(EINVAL));
    }

    dsl_dataset_rele(ds, FTAG);
    return (0);
}

static void
zhack_set_guid_sync(void *arg, dmu_tx_t *tx)
{
    zhack_set_guid_arg_t *sga = arg;
    dsl_pool_t *dp = dmu_tx_pool(tx);
    dsl_dataset_t *ds;

    VERIFY0(dsl_dataset_hold(dp, sga->snap_name, FTAG, &ds));

    dmu_buf_will_dirty(ds->ds_dbuf, tx);

    uint64_t old_guid = dsl_dataset_phys(ds)->ds_guid;
    dsl_dataset_phys(ds)->ds_guid = sga->new_guid;

    spa_history_log_internal(dp->dp_spa, "zhack set-guid", tx,
        "snap=%s old_guid=%llu new_guid=%llu",
        sga->snap_name,
        (u_longlong_t)old_guid,
        (u_longlong_t)sga->new_guid);

    dsl_dataset_rele(ds, FTAG);
}

The check function validates in open context (is it a snapshot?). The sync function runs in syncing context within a transaction group — it dirties the bonus buffer, writes the new GUID, and logs to spa history. dsl_sync_task() orchestrates both phases and ensures atomicity.

The wrapper handles argument parsing, prints before/after state, and verifies the change took effect:

$ zhack dataset set-guid tank \
    tank/backups/documents-backup@synced \
    3253422282232884670

Dataset:      tank/backups/documents-backup@synced
Current GUID: 744223836558325714
New GUID:     3253422282232884670
Verified GUID: 3253422282232884670

The full patch is ~160 lines of C, following the same patterns as zhack's existing feature enable and feature ref subcommands.

Before Running It: Source Code Review

This was going to run against the only offsite copy of 9.5TB of data. I wanted to be very sure. I cloned the OpenZFS repo and investigated five specific areas:

1. Is there a pool-wide GUID registry? No. There's no hash table, AVL tree, or lookup cache for snapshot GUIDs anywhere in the pool metadata. GUIDs are only compared by value during send/recv. Changing one won't leave a stale reference in some index.

2. Does dsl_dataset_hold() work inside dsl_sync_task()? Yes. This is exactly how dsl_dataset_rename_snapshot_check(), dsl_dataset_snapshot_sync_impl(), and dozens of other ZFS operations work. The config lock is held by the dsl_sync_task machinery.

3. Will checksums cascade correctly? Yes. dmu_buf_will_dirty() marks the buffer dirty, which cascades up through parent blocks via dbuf_dirty(). The ZFS write pipeline computes checksums at each level during txg sync. Raidz parity is handled by the ZIO layer, same as any normal write.

4. Are there double-close issues with spa lifecycle? No. Every zhack subcommand calls spa_close() before returning, then main() calls spa_export(). spa_close() decrements a reference count; spa_export() flushes and exports. Same pattern throughout zhack.

5. Will bookmarks break? Bookmarks created before the GUID change will have the old GUID and become stale for incremental sends. They fail with ENODEV (safe failure, no corruption). Not an issue for this use case — no bookmarks reference @synced.

Building on the Offsite

The offsite is an aarch64 Alpine Linux box. Building OpenZFS userspace tools on Alpine/musl has a couple of quirks:

  • libintl is not part of musl's libc — you need gettext-dev and LIBS="-lintl" at configure time
  • zfs_gitrev.h is a generated file — run make gitrev before the main build
apk add build-base autoconf automake libtool git linux-headers \
    libtirpc-dev openssl-dev zlib-dev util-linux-dev libaio-dev \
    attr-dev python3 gettext-dev

git clone --depth 1 --branch zhack-dataset-set-guid \
    https://github.com/evanpurkhiser/zfs.git /root/zfs-build

cd /root/zfs-build
./autogen.sh
./configure --with-config=user LIBS="-lintl"
make gitrev && make -j$(nproc) zhack

The Munge

# Get the local GUID
$ zfs get guid documents@synced
documents@synced  guid  3253422282232884670

# Export the offsite pool
$ zpool export tank

# Change the GUID
$ cd /root/zfs-build
$ ./zhack dataset set-guid tank \
    tank/backups/documents-backup@synced \
    3253422282232884670

Dataset:      tank/backups/documents-backup@synced
Current GUID: 744223836558325714
New GUID:     3253422282232884670
Verified GUID: 3253422282232884670

# Re-import
$ zpool import tank

Pool imported cleanly. zfs get guid confirmed the match on both sides.

The Test

$ zfs send -i documents@synced documents@test_incremental | \
    ssh root@offsite 'zfs recv tank/backups/documents-backup'

It worked. The incremental was recognized, the stream was accepted, and the offsite received the new snapshot. No errors, no corruption.

A month-long retransmission avoided with an 8-byte write.

Lessons

Fix the root cause. zrepl's pruning config was the real problem. A single keep rule now protects non-zrepl snapshots:

pruning:
  keep:
    - type: regex
      regex: "^(?!zrepl_).*"

ZFS GUIDs are the sole identity for send/recv matching. Not the name, not the creation time, not the transaction group. Just a 64-bit random number in dsl_dataset_phys_t at offset 112. The code in dmu_recv.c is unambiguous about this.

Never modify ZFS metadata outside the transaction machinery. The Merkle tree makes direct byte patching essentially impossible. But dmu_buf_will_dirty()

  • dmu_tx_commit() is the exact same codepath ZFS uses for every metadata write. It's not a hack — it's the normal way.

zhack is a powerful and underappreciated tool. It already links against libzpool and has the dsl_sync_task() infrastructure. Adding a new subcommand is straightforward if you follow the existing patterns.

The patch is at evanpurkhiser/zfs@zhack-dataset-set-guid if anyone else finds themselves in a similar situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment