Date: 2026-03-20 to 2026-03-24
Update-verify tasks run ~6 minutes slower on AMD workers (b-linux-docker-amd) because those GCP machines (c3d-standard-16-lssd) only get 1 NVMe SSD, while Intel workers (c2-standard-16) get 2 NVMe SSDs in an LVM stripe, giving ~2x I/O throughput for the large-file copy pattern.
It's a provisioning asymmetry, not a CPU/Docker/image issue. AMD machines with 2 NVMe drives (c3d-standard-30-lssd) match Intel IO performance exactly.
Update-verify tasks on b-linux-docker-amd (c3d-standard-16-lssd) take longer than on b-linux (c2-standard-16).
Source data points here.
Update-verify tasks run a cached_download.sh (link)
loop that does grep + cp on cached files up to 120 MB in size, repeated up to 10,000 times per chunk.
On AMD workers (b-linux-docker-amd, c3d-standard-16-lssd), this workload runs about 6 minutes
slower than on Intel workers (b-linux, c2-standard-16). (from ~39 to ~45 mins duration)
Hypothesis: The c3d-standard-16-lssd hardware (CPU/NVMe) is inherently slower than c2-standard-16 for this I/O pattern.
We spun up matching GCP VMs and ran fio + application-level benchmarks (grep + cp on 50 files, 1000 iterations) directly on the VMs.
Result: Both machines performed identically.
| Test | c2 (Intel) | c3d (AMD) | Ratio |
|---|---|---|---|
| App-level avg (PD-SSD) | 176 ms | 178 ms | 0.99x |
| App-level avg (NVMe) | 149 ms | 149 ms | 1.00x |
| fio rand read IOPS (NVMe) | 180,073 | 180,008 | 1.00x |
| fio rand write IOPS (NVMe) | 100,029 | 100,022 | 1.00x |
Verdict: Hardware ruled out. Both machine types have identical NVMe hardware (same vendor ID 0x1ae0, same IOPS caps).
Hypothesis: Docker overlay2 or the Firefox CI image introduces overhead that affects c3d differently.
We loaded the actual update-verify:latest Docker image from Taskcluster
artifacts and tested five configurations: bare VM (PD-SSD and NVMe), Docker
overlay2, Docker with bind-mounted PD-SSD, Docker with bind-mounted NVMe.
Result: All five configurations within 3%. Docker adds <2ms overhead, same on both machines.
| Configuration | c2 (Intel) | c3d (AMD) | Ratio |
|---|---|---|---|
| Bare VM PD-SSD | 171 ms | 171 ms | 1.00x |
| Bare VM NVMe | 145 ms | 150 ms | 0.97x |
| Docker overlay2 | 169 ms | 171 ms | 0.99x |
| Docker + bind PD-SSD | 170 ms | 173 ms | 0.98x |
| Docker + bind NVMe | 150 ms | 147 ms | 1.02x |
Verdict: Docker/overlay2 and CI image ruled out.
Hypothesis: The Taskcluster workers use different Persistent Disk types (pd-standard vs pd-balanced) which could explain the gap.
We checked with gcloud compute disks list and found AMD workers use
pd-balanced while Intel workers use pd-standard. We re-ran VM benchmarks with
matching disk types.
Result: pd-balanced is actually faster than pd-standard (3x more write IOPS). This would make AMD faster, not slower. But this turned out to be irrelevant entirely because...
Verdict: Boot disk type is irrelevant -- task workloads don't use the boot disk.
Hypothesis: Something in Taskcluster's worker provisioning is different.
We added disk/storage diagnostics (lsblk, findmnt, mount, cgroup checks)
to the benchmark script and ran it on actual Taskcluster workers via try push.
This revealed the root cause.
Storage topology on live workers:
Intel (b-linux) |
AMD-16 (b-linux-docker-amd) |
|
|---|---|---|
| Local NVMe drives | 2x 375G (nvme0n1, nvme0n2) |
1x 375G (nvme1n1) |
| LVM volume | 738G, stripe=32 | 369G, no stripe |
| Filesystem | ext4, nobarrier, data=writeback | ext4, nobarrier, data=writeback |
| cgroup I/O limits | None | None |
| cgroup memory limits | None | None |
Intel workers have 2 NVMe SSDs in an LVM stripe giving ~2x sequential throughput. AMD workers have 1 NVMe SSD with no striping.
First Taskcluster benchmark results (tasks Qa_pc1spSEuJzg9q-92KBg and QgQJlSSUR_SkV7NE7qVaAg:
| Metric | Intel (2x NVMe) | AMD-16 (1x NVMe) | Ratio |
|---|---|---|---|
| Avg | 76 ms | 146 ms | 1.92x |
The 1.92x ratio matches the 2:1 NVMe drive count almost perfectly.
Hypothesis: If the root cause is 1 vs 2 NVMe drives, then an AMD machine with 2 NVMe drives should match Intel.
The c3d-standard-16-lssd only supports 1 local SSD (fixed by GCP, cannot be
changed). The c3d-standard-30-lssd (used by b-linux-docker-large-amd) gets
2 local SSDs. We added an amd-30 benchmark task and ran all three.
Tasks: O0wOvbnEQF6h8ptHBA6jbg (Intel), Yx4AjHXCRfyptBUIPTPtEg (AMD-30), WTIaalknRCmmcyO78GGIAg (AMD-16).
Result:
| Intel (b-linux) | AMD-30 (b-linux-docker-large-amd) | AMD-16 (b-linux-docker-amd) | |
|---|---|---|---|
| Machine | c2-standard-16 | c3d-standard-30-lssd | c3d-standard-16-lssd |
| NVMe drives | 2x 375G | 2x 375G | 1x 375G |
| LVM stripe | stripe=32 | stripe=32 | no stripe |
| Avg | 76 ms | 77 ms | 146 ms |
| P50 | 80 ms | 82 ms | 157 ms |
| P95 | 109 ms | 115 ms | 205 ms |
| P99 | 118 ms | 149 ms | 237 ms |
| Total (2000 ops) | 151.9s | 154.8s | 291.4s |
| Ratio vs Intel | 1.00x | 1.01x | 1.92x |
AMD-30 with 2 NVMe drives matches Intel exactly (1.01x). The AMD-16 with 1 NVMe remains at 1.92x.
| # | Hypothesis | How Tested | Result |
|---|---|---|---|
| 1 | AMD CPU/NVMe hardware is slower | fio + app benchmarks on matching VMs | Identical performance |
| 2 | Docker overlay2 adds overhead on c3d | 5 storage configs with Firefox CI image | <3% difference |
| 3 | Firefox CI Docker image is the issue | Used exact update-verify:latest |
No difference |
| 4 | PD boot disk type (pd-standard vs pd-balanced) | Tested both, checked actual disk types | Irrelevant -- tasks use local NVMe |
| 5 | cgroup I/O throttling | Checked io.max, blkio.throttle.* on live workers | No limits set |
| 6 | cgroup memory limits | Checked memory.max on live workers | No limits set |
| 7 | Docker daemon configuration differences | Inspected mount options on live workers | Identical ext4 options |
| 8 | fork/exec overhead | Measured subprocess spawn latency | 30 us difference (negligible) |
The c3d-standard-16-lssd GCP machine type includes only 1 local NVMe SSD (375 GB). The c2-standard-16 is provisioned with 2 local NVMe SSDs (2x 375 GB) configured in an LVM stripe (stripe=32), giving it ~2x sequential I/O throughput.
The update-verify workload is dominated by sequential large-file copies (60-80 MB), which is exactly the pattern that benefits from LVM striping across multiple drives.
This is not a bug in Taskcluster, Docker, or the CI image. It is a
provisioning asymmetry: Intel workers get 2 local SSDs while AMD-16 workers
get 1, because the -lssd machine type variants have a fixed number of local
SSDs determined by vCPU count:
- c3d-standard-16-lssd: 1x 375G local SSD
- c3d-standard-30-lssd: 2x 375G local SSD
- c3d-standard-44-lssd: 4x 375G local SSD
https://docs.cloud.google.com/compute/docs/general-purpose-machines#c3d_machine_types
The c2 family allows attaching local SSDs independently of vCPU count, so c2-standard-16 can have 2 (or more) local SSDs attached.
Note: this is the final form of the script used to benchmark IO in tasks.
#!/usr/bin/env python3
"""
I/O benchmark that simulates the cache-retrieve pattern from update-verify.
The update-verify task caches downloaded MARs and tarballs on disk, then
retrieves them via cached_download.sh which does:
1. grep -Fnx "$url" urls.list (lookup URL in text index)
2. cp obj_XXXXX.cache output (copy 60-80MB file to working dir)
This is repeated ~9,784 times per chunk. On c3d-standard-16-lssd workers
(b-linux-docker-amd), this cp through Docker overlay2 shows 1.69x worse
tail latency than on c2-standard-16 (b-linux), adding ~14 minutes.
This benchmark reproduces that exact pattern to measure per-worker I/O.
"""
import os
import random
import shutil
import statistics
import subprocess
import sys
import time
BASE_DIR = "/builds/worker/workspace"
CACHE_DIR = os.path.join(BASE_DIR, "uv-cache")
WORK_DIR = os.path.join(BASE_DIR, "uv-work")
NUM_CACHE_FILES = 100
# 40 x 80MB (MARs) + 40 x 60MB (tarballs) + 20 x 5MB (small artifacts)
SIZES_MB = [80] * 40 + [60] * 40 + [5] * 20
NUM_RETRIEVES = 2000
MB = 1024 * 1024
def print_system_info():
print("=== System info ===")
os.system("uname -a")
os.system('grep "model name" /proc/cpuinfo | head -1')
os.system("free -h | head -2")
os.system(f"df -Th {BASE_DIR}")
os.system(f"findmnt {BASE_DIR} 2>/dev/null || true")
os.system("lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT,ROTA,MODEL 2>/dev/null || true")
os.system("cat /sys/block/*/queue/scheduler 2>/dev/null || true")
os.system("mount | grep -E 'workspace|builds|overlay' || true")
print()
print("=== cgroup I/O limits ===")
os.system("cat /sys/fs/cgroup/io.max 2>/dev/null || true")
os.system("cat /sys/fs/cgroup/io.latency 2>/dev/null || true")
os.system(
"cat /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device 2>/dev/null || true"
)
os.system(
"cat /sys/fs/cgroup/blkio/blkio.throttle.write_bps_device 2>/dev/null || true"
)
print()
print("=== cgroup memory limits ===")
os.system("cat /sys/fs/cgroup/memory.max 2>/dev/null || true")
os.system("cat /sys/fs/cgroup/memory.current 2>/dev/null || true")
print()
def populate_cache():
print("=== Phase 1: Populate cache (simulates async_download.py) ===")
os.makedirs(CACHE_DIR, exist_ok=True)
os.makedirs(WORK_DIR, exist_ok=True)
urls = []
for i in range(NUM_CACHE_FILES):
sz = SIZES_MB[i] * MB
fname = os.path.join(CACHE_DIR, f"obj_{i+1:05d}.cache")
url = (
f"https://archive.mozilla.org/pub/firefox/fake/"
f"{i+1}/firefox-{SIZES_MB[i]}MB.mar"
)
urls.append(url)
with open(fname, "wb") as f:
written = 0
# Reuse one random chunk to avoid spending all time in urandom
chunk = os.urandom(min(MB, sz))
while written < sz:
f.write(chunk)
written += len(chunk)
sys.stdout.write(f"\r Created {i+1}/{NUM_CACHE_FILES} ({SIZES_MB[i]}MB)")
sys.stdout.flush()
print()
urls_path = os.path.join(CACHE_DIR, "urls.list")
with open(urls_path, "w") as f:
f.writelines(u + "\n" for u in urls)
print(f" urls.list: {len(urls)} entries")
total_gb = sum(SIZES_MB) / 1024
print(f" Total cache: {total_gb:.1f} GB")
os.system("sync")
os.system("echo 3 > /proc/sys/vm/drop_caches 2>/dev/null || true")
print()
return urls
def run_benchmark(urls):
urls_path = os.path.join(CACHE_DIR, "urls.list")
print("=== Phase 2: Cache retrieves (simulates cached_download.sh) ===")
print(f" {NUM_RETRIEVES} retrieves: grep urls.list + cp obj_XXXXX.cache")
print()
timings = []
size_timings = {80: [], 60: [], 5: []}
for r in range(NUM_RETRIEVES):
idx = random.randint(0, NUM_CACHE_FILES - 1)
url = urls[idx]
sz = SIZES_MB[idx]
cache_file = os.path.join(CACHE_DIR, f"obj_{idx+1:05d}.cache")
out_file = os.path.join(WORK_DIR, "output.dat")
t0 = time.monotonic()
subprocess.run(["grep", "-Fnx", url, urls_path], check=False, capture_output=True)
shutil.copy2(cache_file, out_file)
t1 = time.monotonic()
elapsed_ms = (t1 - t0) * 1000
timings.append(elapsed_ms)
size_timings[sz].append(elapsed_ms)
if (r + 1) % 200 == 0:
avg = statistics.mean(timings)
print(f" [{r+1:4d}/{NUM_RETRIEVES}] last={elapsed_ms:.0f}ms avg={avg:.0f}ms")
return timings, size_timings
def report(label, data):
if not data:
return
data.sort()
n = len(data)
total = sum(data)
avg = total / n
p50 = data[n * 50 // 100]
p95 = data[n * 95 // 100]
p99 = data[n * 99 // 100]
print()
print(f"--- {label} ---")
print(f" Count: {n}")
print(f" Total: {total:.0f}ms ({total/1000:.1f}s)")
print(f" Avg: {avg:.0f}ms")
print(f" Min: {data[0]:.0f}ms")
print(f" Max: {data[-1]:.0f}ms")
print(f" P50: {p50:.0f}ms")
print(f" P95: {p95:.0f}ms")
print(f" P99: {p99:.0f}ms")
print()
print(" Histogram:")
buckets = {}
for v in data:
b = int(v // 100) * 100
b = min(b, 2000)
buckets[b] = buckets.get(b, 0) + 1
for b in sorted(buckets):
bar = "#" * min(buckets[b], 60)
hi = f"{b+99}" if b < 2000 else "+"
print(f" {b:5d}-{hi:>5s}ms: {bar} ({buckets[b]})")
def main():
print_system_info()
urls = populate_cache()
timings, size_timings = run_benchmark(urls)
print()
print("=========================================")
print("=== RESULTS ===")
print("=========================================")
report("All retrieves", timings)
for sz in [80, 60, 5]:
report(f"{sz}MB files", size_timings[sz])
print()
print("=== Benchmark complete ===")
if __name__ == "__main__":
main()---
loader: taskgraph.loader.transform:loader
transforms:
- gecko_taskgraph.transforms.io_benchmark:transforms
- gecko_taskgraph.transforms.job:transforms
- gecko_taskgraph.transforms.task:transforms
task-defaults:
worker:
docker-image:
in-tree: "update-verify"
max-run-time: 3600
run:
using: run-task
checkout: false
command: # To be filled by transform
treeherder:
kind: test
platform: linux64/opt
tier: 3
tasks:
amd-16:
description: I/O cache-retrieve benchmark on b-linux-docker-amd
worker-type: b-linux-docker-amd
treeherder:
symbol: IOB(amd16)
amd-30:
description: I/O cache-retrieve benchmark on b-linux-docker-large-amd
worker-type: b-linux-docker-large-amd
treeherder:
symbol: IOB(amd30)
intel:
description: I/O cache-retrieve benchmark on b-linux
worker-type: b-linux
treeherder:
symbol: IOB(intel)import os
from taskgraph.transforms.base import TransformSequence
transforms = TransformSequence()
SCRIPT_PATH = os.path.join(
os.path.dirname(__file__), "..", "..", "scripts", "misc", "io-benchmark.py"
)
@transforms.add
def add_command(config, jobs):
with open(SCRIPT_PATH) as f:
script = f.read()
for job in jobs:
job["run"]["command"] = (
"cat > /tmp/io-benchmark.py << 'BENCH_EOF'\n"
f"{script}"
"BENCH_EOF\n"
"python3 -u /tmp/io-benchmark.py"
)
yield job