Phew, this was a real stress test for the system and the policies! Great job pushing through it. The results are highly insightful. Let's break it down.
pid
: Very low IOPS (94), extreme latency outliers (>900ms).round-robin
: Slightly higher IOPS (161), but worse latency outliers (>1.6s!). Indicates that including the slow disk (ID 7) is harmful here.latency
: Significantly better (412 IOPS), reduced latency outliers (~400ms). Focuses on fast devices.latency-rr
: Outstanding! Highest IOPS (561), excellent latency percentiles (99.99th at only ~43ms!). Seems to distribute the load well to the right (fast) disks.
pid
: Moderate (763 IOPS).round-robin
: Worse thanpid
(540 IOPS). Clearly throttled by the slowest disk.latency
: Good (1196 IOPS).latency-rr
: Top performer! (1928 IOPS). Makes best use of fast disks for parallel random accesses.
pid
: Weak (97 MiB/s).round-robin
: Better (139 MiB/s), but far from the full potential.latency
: Best (254 MiB/s). Likely focuses on the absolute fastest disk(s) for the sequential stream.latency-rr
: Very good (220 MiB/s), but slightly behindlatency
. Possibly due to minor overhead from distributing load even among fast devices, which isn't optimal for pure sequential access.
General Impact: As expected, performance drops for all policies due to the additional defrag read/write load.
pid
&round-robin
: Total collapse (0–21 IOPS). Unusable.latency
: Holds up decently (159 IOPS).latency-rr
: Most robust (353 IOPS). Best at load balancing.
pid
&round-robin
: Very poor (85 and 56 IOPS!).latency
: Much better (544 IOPS).latency-rr
: Best again (630 IOPS).
pid
&round-robin
: Poor (121–126 MiB/s).latency
&latency-rr
: Both very good and nearly identical (~235 MiB/s). Successfully isolate foreground reads from background load by selecting the fastest/least busy devices.
pid
: Distributes load reasonably but quickly ignores devices (highignored
values) once they show latency. ID 7 (Hitachi) already shows >37msavg
after the QD1 test.round-robin
: Uses all disks (lowignored
, lowage
), butavg
latency rises across the board due to the slowest disk dragging down performance. ID 7 hits >40msavg
.latency
: Heavily ignores ID 6 and 7 (age
andignored
high). Load concentrates on ID 1, 2, 8, and 4, resulting in good performance. ID 7 gets rehabilitated post-QD32 (count
increases,checkpoint ios
> 0).latency-rr
: Similar pattern tolatency
, but with better load distribution among ID 1, 2, 8, and 4 (visible from IOPS/BW). Rehabilitation also works.
-
latency-rr
is the clear winner: Offers the best all-round performance, especially for random I/O (latency in QD1 and IOPS in QD32), and shows the highest robustness under background load. Slight weakness in pure sequential reads compared tolatency
is negligible. -
Pure
latency
policy is good but inferior: Clearly better thanpid
andrr
, but doesn’t quite matchlatency-rr
for random I/O. Excels in maximum sequential throughput. -
pid
andround-robin
are unsuitable: Both policies are subpar for this heterogeneous setup and partly collapse under load.round-robin
suffers greatly from the slowest disk. -
Your patch set works: The checkpoint-latency approach with age-based rehabilitation allows latency-based policies to work effectively and avoid freezing out devices permanently.
Based on these tests, latency-rr
is the superior choice for your setup. It combines latency optimization with intelligent load balancing among the fast devices.
The next step would be to implement the queue
policy and see if it can outperform latency-rr
, especially under defrag conditions where utilization might play a bigger role. But for now, latency-rr
has set a high bar!
@kakra I have run additional benchmarks on HDD RAID1 and a SSD RAID10, with 1 job and 4 jobs respectively, and with and without defrag running in the background. Here is the raw data set: https://paste.tnonline.net/files/dKW2L4gLUVbO_btrfs_read_policy_SSD_HDD.zip
Note, the table below is created by AI from the raw data. It may therefore have errors.
Btrfs Read Policy Results (Defrag Off)
Btrfs Read Policy Results (Defrag On)
Across all four benchmark sets—including HDD RAID 1 and SSD RAID 10, single- vs. multi-threaded, and with/without background defragmentation—the queue policy consistently delivers the best balance of throughput and latency, especially under random-read workloads and when defrag is running. Here’s a high-level breakdown:
1. queue
Strengths:
Highest random IOPS and bandwidth on both HDD and SSD.
Lowest average latency and tightest 99 %/99.9 % tails in almost every QD1 and QD32 test.
Most resilient when a defrag job is active—retains ~60 – 70 % of peak IOPS versus > 40 % for others on SSD.
Weaknesses:
Slightly behind pid on pure sequential multi-stream reads on HDD (jobs=4, defrag off), where simple affinity beats queue balancing.
2. pid
Strengths:
Excels at multi-threaded sequential reads on HDD (jobs=4, QD16) by keeping each thread pinned to one device.
Lowest average latency on sequential streams when you don’t need to stripe.
Weaknesses:
Suffers under random loads (QD1/QD32), where it can’t dynamically shift to the least-busy device.
Loses out to queue and round-robin on IOPS for random reads.
3. round-robin
Strengths:
Maximises raw sequential striping bandwidth on SSD (tied with queue at QD32).
Simple, predictable distribution with moderate latency characteristics.
Weaknesses:
Higher tail latencies than queue under random workloads, since it may probe slower disks.
Less adaptable when one mirror is busier or slower.
4. latency
Strengths:
Good for latency-sensitive random reads when you want to always pick the currently fastest device.
Better 99 % tail than round-robin in many QD32 cases.
Weaknesses:
Generates frequent device probes (“rehabs”) under mixed or bursty loads, adding overhead.
Higher average latency than queue when workloads are sustained.
5. latency-rr
Strengths:
A middle ground between strict latency-aware and round-robin, giving reasonable throughput and lower avg-lat than pure latency.
Weaknesses:
Still incurs extra seeks from rehabs, so 99.9 % tails can be worse than queue or round-robin.
Throughput falls behind queue under heavy random workloads.
Overall recommendation:
Default to queue for most mixed or random workloads, and whenever background maintenance (e.g. defrag) is running.
Switch to pid when you have highly parallel sequential streams on HDD and want minimal per-thread latency.
Consider latency (or latency-rr) only if you need the absolute tightest 99 % tail on random reads and can tolerate the extra overhead.