Confidence: ~95 % it’s just the filesystem, not the SSD
Evidence | What it means |
---|---|
SMART: 0 media/data-integrity errors, 0 error log entries, 5 % wear | Drive hardware is behaving normally. |
NVMe logs: no aborts, resets, or PCIe AER errors shown | No transport or controller faults. |
Btrfs stats: all *_io_errs counters at 0 |
The kernel never saw a write fail at the block layer. |
Metadata chunk 93 % full at RO flip | Btrfs is documented to abort a transaction and remount RO when it can’t reserve metadata space. |
A bad SSD almost always leaves fingerprints—CRC errors, aborted commands, or incrementing error logs—which you don’t have. The only smoking gun you do have is “metadata almost full”.
-
Run the NVMe built-in long self-test (3–5 h, non-destructive):
sudo smartctl -t long /dev/nvme0 sudo smartctl -a /dev/nvme0 # after it finishes
A failing controller will report errors here.
-
Scrub after you’ve freed space and remounted RW:
sudo btrfs scrub start -Bd /
Scrub rereads every data/metadata block; it will surface silent corruption if any exists.
-
Watch dmesg while running heavy I/O (e.g.,
fio
or copying large files). Any hidden PCIe or NAND issue tends to show up within minutes under load.
If those three come back clean, treat the incident as a pure Btrfs ENOSPC event.
- Firmware bugs can exist even in a “clean” drive, but they usually manifest as resets/aborts first, not silent RO flips.
- Bad RAM can corrupt metadata in flight; a full memtest is cheap insurance.
All signs point to recoverable metadata ENOSPC, not dying hardware. Free a few GB, run the metadata-only balance, scrub, and keep at least 2 GiB metadata headroom. If any of the checks above lights up afterwards, reassess—but odds are heavily in your favor.