After a misconfigured pruning job destroyed the only common snapshot between my
local and offsite ZFS pools, I was facing a month-long full retransmit over a
4MB/s link. Instead, I patched OpenZFS's zhack tool to rewrite a snapshot's
on-disk GUID, re-establishing the incremental send/recv chain in seconds.
I have a documents pool (~9.6TB, raidz1) backed up to an offsite tank pool
over Tailscale. The initial transfer of documents@initial to
tank/backups/documents-backup@initial took about a month at 4MB/s. Once
complete, the plan was to use zfs send -i for fast incremental backups.
I use zrepl for automated snapshot management.
My zrepl config had a pruning rule that kept snapshots matching ^zrepl_.* and
deleted everything else. I hadn't noticed the implication: my manually-created
@initial snapshot didn't match that regex.
For five days, zrepl tried to destroy @initial and failed — the send script
still had a hold on it. The moment the send job finished and released the hold,
zrepl pruned @initial. Gone in an instant.
The offsite still had tank/backups/documents-backup@initial, but the local
@initial was destroyed. Without a common snapshot, zfs send -i won't work.
The pools were orphaned from each other.
My first instinct was to rewind the pool. ZFS keeps a ring of uberblocks
(root-of-the-Merkle-tree pointers), and zpool import -T <txg> can import at
a historical transaction group. I found TXG 45195808 in the uberblock ring —
timestamped just 4 minutes before the deletion.
best uberblock found for spa documents. txg 45195808
label discarded as txg is too large (45197259 > 45195808)
using uberblock with txg=45195808
FAILED: unable to retrieve MOS config
No luck. ZFS is copy-on-write: the ~1,400 TXGs of pool activity since that uberblock had recycled the MOS (Meta Object Set) blocks it pointed to. The uberblock was there, but the metadata tree beneath it was gone.
ZFS incremental send identifies the "from" snapshot solely by GUID — a random 64-bit integer assigned at snapshot creation. It doesn't check the name, creation time, or transaction group. The receive side just walks the snapshot chain comparing GUIDs:
// module/zfs/dmu_recv.c:464-475
while (obj != 0) {
error = dsl_dataset_hold_obj(dp, obj, FTAG, &snap);
if (dsl_dataset_phys(snap)->ds_guid == fromguid)
break;
obj = dsl_dataset_phys(snap)->ds_prev_snap_obj;
dsl_dataset_rele(snap, FTAG);
}So if I created matching @synced snapshots on both pools with identical
content and then changed one's GUID to match the other, ZFS would treat them as
the same snapshot. Incremental sends would just work.
The data was already identical — the offsite was a faithful zfs send | recv
copy, and almost nothing had changed on the local side since the original
transfer (the zrepl snapshots showed 0B deltas for the most recent days).
I created @synced on both sides and verified they matched: same object IDs,
same file sizes, same directory structure. A full checksum comparison would run
overnight to be sure.
Now I just needed to change a single 8-byte field on disk.
ZFS stores everything in a Merkle tree. The snapshot's ds_guid lives in a
dsl_dataset_phys_t structure (320 bytes, stored as a dnode bonus buffer in the
MOS). That bonus buffer is part of a dnode block, which is checksummed by its
parent block pointer, which is checksummed by its parent, all the way up to
the uberblock.
Changing one byte with dd would invalidate every checksum up the tree. The
pool would be corrupt.
But ZFS's own transaction machinery handles this automatically. When you modify
a buffer through dmu_buf_will_dirty() and commit the transaction, ZFS does
copy-on-write at every level: allocates new blocks, computes new checksums,
writes new block pointers, and atomically commits a new uberblock. Raidz parity
is recomputed by the ZIO pipeline as part of the normal write path.
The trick is to go through ZFS's own code, not around it.
OpenZFS ships a tool called zhack — a debugging utility that can write to pool
metadata using libzpool. It's used for things like enabling feature flags and
repairing labels. It already has the infrastructure for safe metadata writes via
dsl_sync_task().
I added a dataset set-guid subcommand. The core is about 30 lines:
static int
zhack_set_guid_check(void *arg, dmu_tx_t *tx)
{
zhack_set_guid_arg_t *sga = arg;
dsl_pool_t *dp = dmu_tx_pool(tx);
dsl_dataset_t *ds;
int error;
error = dsl_dataset_hold(dp, sga->snap_name, FTAG, &ds);
if (error != 0)
return (error);
if (!ds->ds_is_snapshot) {
dsl_dataset_rele(ds, FTAG);
return (SET_ERROR(EINVAL));
}
dsl_dataset_rele(ds, FTAG);
return (0);
}
static void
zhack_set_guid_sync(void *arg, dmu_tx_t *tx)
{
zhack_set_guid_arg_t *sga = arg;
dsl_pool_t *dp = dmu_tx_pool(tx);
dsl_dataset_t *ds;
VERIFY0(dsl_dataset_hold(dp, sga->snap_name, FTAG, &ds));
dmu_buf_will_dirty(ds->ds_dbuf, tx);
uint64_t old_guid = dsl_dataset_phys(ds)->ds_guid;
dsl_dataset_phys(ds)->ds_guid = sga->new_guid;
spa_history_log_internal(dp->dp_spa, "zhack set-guid", tx,
"snap=%s old_guid=%llu new_guid=%llu",
sga->snap_name,
(u_longlong_t)old_guid,
(u_longlong_t)sga->new_guid);
dsl_dataset_rele(ds, FTAG);
}The check function validates in open context (is it a snapshot?). The sync
function runs in syncing context within a transaction group — it dirties the
bonus buffer, writes the new GUID, and logs to spa history. dsl_sync_task()
orchestrates both phases and ensures atomicity.
The wrapper handles argument parsing, prints before/after state, and verifies the change took effect:
$ zhack dataset set-guid tank \
tank/backups/documents-backup@synced \
3253422282232884670
Dataset: tank/backups/documents-backup@synced
Current GUID: 744223836558325714
New GUID: 3253422282232884670
Verified GUID: 3253422282232884670
The full patch is ~160 lines of C,
following the same patterns as zhack's existing feature enable and feature ref
subcommands.
This was going to run against the only offsite copy of 9.5TB of data. I wanted to be very sure. I cloned the OpenZFS repo and investigated five specific areas:
1. Is there a pool-wide GUID registry? No. There's no hash table, AVL tree, or lookup cache for snapshot GUIDs anywhere in the pool metadata. GUIDs are only compared by value during send/recv. Changing one won't leave a stale reference in some index.
2. Does dsl_dataset_hold() work inside dsl_sync_task()? Yes. This is
exactly how dsl_dataset_rename_snapshot_check(),
dsl_dataset_snapshot_sync_impl(), and dozens of other ZFS operations work. The
config lock is held by the dsl_sync_task machinery.
3. Will checksums cascade correctly? Yes. dmu_buf_will_dirty() marks the
buffer dirty, which cascades up through parent blocks via dbuf_dirty(). The
ZFS write pipeline computes checksums at each level during txg sync. Raidz
parity is handled by the ZIO layer, same as any normal write.
4. Are there double-close issues with spa lifecycle? No. Every zhack
subcommand calls spa_close() before returning, then main() calls
spa_export(). spa_close() decrements a reference count; spa_export()
flushes and exports. Same pattern throughout zhack.
5. Will bookmarks break? Bookmarks created before the GUID change will
have the old GUID and become stale for incremental sends. They fail with
ENODEV (safe failure, no corruption). Not an issue for this use case — no
bookmarks reference @synced.
The offsite is an aarch64 Alpine Linux box. Building OpenZFS userspace tools on Alpine/musl has a couple of quirks:
libintlis not part of musl's libc — you needgettext-devandLIBS="-lintl"at configure timezfs_gitrev.his a generated file — runmake gitrevbefore the main build
apk add build-base autoconf automake libtool git linux-headers \
libtirpc-dev openssl-dev zlib-dev util-linux-dev libaio-dev \
attr-dev python3 gettext-dev
git clone --depth 1 --branch zhack-dataset-set-guid \
https://github.com/evanpurkhiser/zfs.git /root/zfs-build
cd /root/zfs-build
./autogen.sh
./configure --with-config=user LIBS="-lintl"
make gitrev && make -j$(nproc) zhack# Get the local GUID
$ zfs get guid documents@synced
documents@synced guid 3253422282232884670
# Export the offsite pool
$ zpool export tank
# Change the GUID
$ cd /root/zfs-build
$ ./zhack dataset set-guid tank \
tank/backups/documents-backup@synced \
3253422282232884670
Dataset: tank/backups/documents-backup@synced
Current GUID: 744223836558325714
New GUID: 3253422282232884670
Verified GUID: 3253422282232884670
# Re-import
$ zpool import tankPool imported cleanly. zfs get guid confirmed the match on both sides.
$ zfs send -i documents@synced documents@test_incremental | \
ssh root@offsite 'zfs recv tank/backups/documents-backup'It worked. The incremental was recognized, the stream was accepted, and the offsite received the new snapshot. No errors, no corruption.
A month-long retransmission avoided with an 8-byte write.
Fix the root cause. zrepl's pruning config was the real problem. A single keep rule now protects non-zrepl snapshots:
pruning:
keep:
- type: regex
regex: "^(?!zrepl_).*"ZFS GUIDs are the sole identity for send/recv matching. Not the name, not
the creation time, not the transaction group. Just a 64-bit random number in
dsl_dataset_phys_t at offset 112. The code in dmu_recv.c is unambiguous
about this.
Never modify ZFS metadata outside the transaction machinery. The Merkle
tree makes direct byte patching essentially impossible. But dmu_buf_will_dirty()
dmu_tx_commit()is the exact same codepath ZFS uses for every metadata write. It's not a hack — it's the normal way.
zhack is a powerful and underappreciated tool. It already links against
libzpool and has the dsl_sync_task() infrastructure. Adding a new
subcommand is straightforward if you follow the existing patterns.
The patch is at evanpurkhiser/zfs@zhack-dataset-set-guid if anyone else finds themselves in a similar situation.