Update-Verify I/O Performance Investigation

Date: 2026-03-20 to 2026-03-24

TL;DR

Update-verify tasks run ~6 minutes slower on AMD workers (b-linux-docker-amd) because those GCP machines (c3d-standard-16-lssd) only get 1 NVMe SSD, while Intel workers (c2-standard-16) get 2 NVMe SSDs in an LVM stripe, giving ~2x I/O throughput for the large-file copy pattern.

It's a provisioning asymmetry, not a CPU/Docker/image issue. AMD machines with 2 NVMe drives (c3d-standard-30-lssd) match Intel IO performance exactly.

Context

Update-verify tasks on b-linux-docker-amd (c3d-standard-16-lssd) take longer than on b-linux (c2-standard-16). Source data points here.

Update-verify tasks run a cached_download.sh (link) loop that does grep + cp on cached files up to 120 MB in size, repeated up to 10,000 times per chunk.

On AMD workers (b-linux-docker-amd, c3d-standard-16-lssd), this workload runs about 6 minutes slower than on Intel workers (b-linux, c2-standard-16). (from ~39 to ~45 mins duration)

Investigation Timeline

Phase 1: GCP VM Benchmarks (bare metal)

Hypothesis: The c3d-standard-16-lssd hardware (CPU/NVMe) is inherently slower than c2-standard-16 for this I/O pattern.

We spun up matching GCP VMs and ran fio + application-level benchmarks (grep + cp on 50 files, 1000 iterations) directly on the VMs.

Result: Both machines performed identically.

Test	c2 (Intel)	c3d (AMD)	Ratio
App-level avg (PD-SSD)	176 ms	178 ms	0.99x
App-level avg (NVMe)	149 ms	149 ms	1.00x
fio rand read IOPS (NVMe)	180,073	180,008	1.00x
fio rand write IOPS (NVMe)	100,029	100,022	1.00x

Verdict: Hardware ruled out. Both machine types have identical NVMe hardware (same vendor ID 0x1ae0, same IOPS caps).

Phase 2: Docker + Firefox CI Image

Hypothesis: Docker overlay2 or the Firefox CI image introduces overhead that affects c3d differently.

We loaded the actual update-verify:latest Docker image from Taskcluster artifacts and tested five configurations: bare VM (PD-SSD and NVMe), Docker overlay2, Docker with bind-mounted PD-SSD, Docker with bind-mounted NVMe.

Result: All five configurations within 3%. Docker adds <2ms overhead, same on both machines.

Configuration	c2 (Intel)	c3d (AMD)	Ratio
Bare VM PD-SSD	171 ms	171 ms	1.00x
Bare VM NVMe	145 ms	150 ms	0.97x
Docker overlay2	169 ms	171 ms	0.99x
Docker + bind PD-SSD	170 ms	173 ms	0.98x
Docker + bind NVMe	150 ms	147 ms	1.02x

Verdict: Docker/overlay2 and CI image ruled out.

Phase 3: PD Disk Type Investigation

Hypothesis: The Taskcluster workers use different Persistent Disk types (pd-standard vs pd-balanced) which could explain the gap.

We checked with gcloud compute disks list and found AMD workers use pd-balanced while Intel workers use pd-standard. We re-ran VM benchmarks with matching disk types.

Result: pd-balanced is actually faster than pd-standard (3x more write IOPS). This would make AMD faster, not slower. But this turned out to be irrelevant entirely because...

Verdict: Boot disk type is irrelevant -- task workloads don't use the boot disk.

Phase 4: Worker Hardware Diffing Inspection

Hypothesis: Something in Taskcluster's worker provisioning is different.

We added disk/storage diagnostics (lsblk, findmnt, mount, cgroup checks) to the benchmark script and ran it on actual Taskcluster workers via try push.

This revealed the root cause.

Storage topology on live workers:

	Intel (`b-linux`)	AMD-16 (`b-linux-docker-amd`)
Local NVMe drives	2x 375G (`nvme0n1`, `nvme0n2`)	1x 375G (`nvme1n1`)
LVM volume	738G, stripe=32	369G, no stripe
Filesystem	ext4, nobarrier, data=writeback	ext4, nobarrier, data=writeback
cgroup I/O limits	None	None
cgroup memory limits	None	None

Intel workers have 2 NVMe SSDs in an LVM stripe giving ~2x sequential throughput. AMD workers have 1 NVMe SSD with no striping.

First Taskcluster benchmark results (tasks Qa_pc1spSEuJzg9q-92KBg and QgQJlSSUR_SkV7NE7qVaAg:

Metric	Intel (2x NVMe)	AMD-16 (1x NVMe)	Ratio
Avg	76 ms	146 ms	1.92x

The 1.92x ratio matches the 2:1 NVMe drive count almost perfectly.

Phase 5: Confirmation with c3d-standard-30-lssd (2 NVMe)

Hypothesis: If the root cause is 1 vs 2 NVMe drives, then an AMD machine with 2 NVMe drives should match Intel.

The c3d-standard-16-lssd only supports 1 local SSD (fixed by GCP, cannot be changed). The c3d-standard-30-lssd (used by b-linux-docker-large-amd) gets 2 local SSDs. We added an amd-30 benchmark task and ran all three.

Tasks: O0wOvbnEQF6h8ptHBA6jbg (Intel), Yx4AjHXCRfyptBUIPTPtEg (AMD-30), WTIaalknRCmmcyO78GGIAg (AMD-16).

Result:

	Intel (b-linux)	AMD-30 (b-linux-docker-large-amd)	AMD-16 (b-linux-docker-amd)
Machine	c2-standard-16	c3d-standard-30-lssd	c3d-standard-16-lssd
NVMe drives	2x 375G	2x 375G	1x 375G
LVM stripe	stripe=32	stripe=32	no stripe
Avg	76 ms	77 ms	146 ms
P50	80 ms	82 ms	157 ms
P95	109 ms	115 ms	205 ms
P99	118 ms	149 ms	237 ms
Total (2000 ops)	151.9s	154.8s	291.4s
Ratio vs Intel	1.00x	1.01x	1.92x

AMD-30 with 2 NVMe drives matches Intel exactly (1.01x). The AMD-16 with 1 NVMe remains at 1.92x.

Hypotheses Tested and Eliminated

#	Hypothesis	How Tested	Result
1	AMD CPU/NVMe hardware is slower	fio + app benchmarks on matching VMs	Identical performance
2	Docker overlay2 adds overhead on c3d	5 storage configs with Firefox CI image	<3% difference
3	Firefox CI Docker image is the issue	Used exact `update-verify:latest`	No difference
4	PD boot disk type (pd-standard vs pd-balanced)	Tested both, checked actual disk types	Irrelevant -- tasks use local NVMe
5	cgroup I/O throttling	Checked io.max, blkio.throttle.* on live workers	No limits set
6	cgroup memory limits	Checked memory.max on live workers	No limits set
7	Docker daemon configuration differences	Inspected mount options on live workers	Identical ext4 options
8	fork/exec overhead	Measured subprocess spawn latency	30 us difference (negligible)

Root Cause

The c3d-standard-16-lssd GCP machine type includes only 1 local NVMe SSD (375 GB). The c2-standard-16 is provisioned with 2 local NVMe SSDs (2x 375 GB) configured in an LVM stripe (stripe=32), giving it ~2x sequential I/O throughput.

The update-verify workload is dominated by sequential large-file copies (60-80 MB), which is exactly the pattern that benefits from LVM striping across multiple drives.

This is not a bug in Taskcluster, Docker, or the CI image. It is a provisioning asymmetry: Intel workers get 2 local SSDs while AMD-16 workers get 1, because the -lssd machine type variants have a fixed number of local SSDs determined by vCPU count:

c3d-standard-16-lssd: 1x 375G local SSD
c3d-standard-30-lssd: 2x 375G local SSD
c3d-standard-44-lssd: 4x 375G local SSD

https://docs.cloud.google.com/compute/docs/general-purpose-machines#c3d_machine_types

The c2 family allows attaching local SSDs independently of vCPU count, so c2-standard-16 can have 2 (or more) local SSDs attached.

Appendix: Test Infrastructure

Benchmark Script

Note: this is the final form of the script used to benchmark IO in tasks.

#!/usr/bin/env python3

"""
I/O benchmark that simulates the cache-retrieve pattern from update-verify.

The update-verify task caches downloaded MARs and tarballs on disk, then
retrieves them via cached_download.sh which does:
  1. grep -Fnx "$url" urls.list   (lookup URL in text index)
  2. cp obj_XXXXX.cache output    (copy 60-80MB file to working dir)

This is repeated ~9,784 times per chunk. On c3d-standard-16-lssd workers
(b-linux-docker-amd), this cp through Docker overlay2 shows 1.69x worse
tail latency than on c2-standard-16 (b-linux), adding ~14 minutes.

This benchmark reproduces that exact pattern to measure per-worker I/O.
"""

import os
import random
import shutil
import statistics
import subprocess
import sys
import time

BASE_DIR = "/builds/worker/workspace"
CACHE_DIR = os.path.join(BASE_DIR, "uv-cache")
WORK_DIR = os.path.join(BASE_DIR, "uv-work")
NUM_CACHE_FILES = 100
# 40 x 80MB (MARs) + 40 x 60MB (tarballs) + 20 x 5MB (small artifacts)
SIZES_MB = [80] * 40 + [60] * 40 + [5] * 20
NUM_RETRIEVES = 2000
MB = 1024 * 1024


def print_system_info():
    print("=== System info ===")
    os.system("uname -a")
    os.system('grep "model name" /proc/cpuinfo | head -1')
    os.system("free -h | head -2")
    os.system(f"df -Th {BASE_DIR}")
    os.system(f"findmnt {BASE_DIR} 2>/dev/null || true")
    os.system("lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT,ROTA,MODEL 2>/dev/null || true")
    os.system("cat /sys/block/*/queue/scheduler 2>/dev/null || true")
    os.system("mount | grep -E 'workspace|builds|overlay' || true")
    print()

    print("=== cgroup I/O limits ===")
    os.system("cat /sys/fs/cgroup/io.max 2>/dev/null || true")
    os.system("cat /sys/fs/cgroup/io.latency 2>/dev/null || true")
    os.system(
        "cat /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device 2>/dev/null || true"
    )
    os.system(
        "cat /sys/fs/cgroup/blkio/blkio.throttle.write_bps_device 2>/dev/null || true"
    )
    print()

    print("=== cgroup memory limits ===")
    os.system("cat /sys/fs/cgroup/memory.max 2>/dev/null || true")
    os.system("cat /sys/fs/cgroup/memory.current 2>/dev/null || true")
    print()


def populate_cache():
    print("=== Phase 1: Populate cache (simulates async_download.py) ===")

    os.makedirs(CACHE_DIR, exist_ok=True)
    os.makedirs(WORK_DIR, exist_ok=True)

    urls = []
    for i in range(NUM_CACHE_FILES):
        sz = SIZES_MB[i] * MB
        fname = os.path.join(CACHE_DIR, f"obj_{i+1:05d}.cache")
        url = (
            f"https://archive.mozilla.org/pub/firefox/fake/"
            f"{i+1}/firefox-{SIZES_MB[i]}MB.mar"
        )
        urls.append(url)
        with open(fname, "wb") as f:
            written = 0
            # Reuse one random chunk to avoid spending all time in urandom
            chunk = os.urandom(min(MB, sz))
            while written < sz:
                f.write(chunk)
                written += len(chunk)
        sys.stdout.write(f"\r  Created {i+1}/{NUM_CACHE_FILES} ({SIZES_MB[i]}MB)")
        sys.stdout.flush()
    print()

    urls_path = os.path.join(CACHE_DIR, "urls.list")
    with open(urls_path, "w") as f:
        f.writelines(u + "\n" for u in urls)
    print(f"  urls.list: {len(urls)} entries")

    total_gb = sum(SIZES_MB) / 1024
    print(f"  Total cache: {total_gb:.1f} GB")

    os.system("sync")
    os.system("echo 3 > /proc/sys/vm/drop_caches 2>/dev/null || true")
    print()
    return urls


def run_benchmark(urls):
    urls_path = os.path.join(CACHE_DIR, "urls.list")

    print("=== Phase 2: Cache retrieves (simulates cached_download.sh) ===")
    print(f"  {NUM_RETRIEVES} retrieves: grep urls.list + cp obj_XXXXX.cache")
    print()

    timings = []
    size_timings = {80: [], 60: [], 5: []}

    for r in range(NUM_RETRIEVES):
        idx = random.randint(0, NUM_CACHE_FILES - 1)
        url = urls[idx]
        sz = SIZES_MB[idx]
        cache_file = os.path.join(CACHE_DIR, f"obj_{idx+1:05d}.cache")
        out_file = os.path.join(WORK_DIR, "output.dat")

        t0 = time.monotonic()
        subprocess.run(["grep", "-Fnx", url, urls_path], check=False, capture_output=True)
        shutil.copy2(cache_file, out_file)
        t1 = time.monotonic()

        elapsed_ms = (t1 - t0) * 1000
        timings.append(elapsed_ms)
        size_timings[sz].append(elapsed_ms)

        if (r + 1) % 200 == 0:
            avg = statistics.mean(timings)
            print(f"  [{r+1:4d}/{NUM_RETRIEVES}] last={elapsed_ms:.0f}ms avg={avg:.0f}ms")

    return timings, size_timings


def report(label, data):
    if not data:
        return
    data.sort()
    n = len(data)
    total = sum(data)
    avg = total / n
    p50 = data[n * 50 // 100]
    p95 = data[n * 95 // 100]
    p99 = data[n * 99 // 100]

    print()
    print(f"--- {label} ---")
    print(f"  Count:  {n}")
    print(f"  Total:  {total:.0f}ms ({total/1000:.1f}s)")
    print(f"  Avg:    {avg:.0f}ms")
    print(f"  Min:    {data[0]:.0f}ms")
    print(f"  Max:    {data[-1]:.0f}ms")
    print(f"  P50:    {p50:.0f}ms")
    print(f"  P95:    {p95:.0f}ms")
    print(f"  P99:    {p99:.0f}ms")

    print()
    print("  Histogram:")
    buckets = {}
    for v in data:
        b = int(v // 100) * 100
        b = min(b, 2000)
        buckets[b] = buckets.get(b, 0) + 1
    for b in sorted(buckets):
        bar = "#" * min(buckets[b], 60)
        hi = f"{b+99}" if b < 2000 else "+"
        print(f"    {b:5d}-{hi:>5s}ms: {bar} ({buckets[b]})")


def main():
    print_system_info()
    urls = populate_cache()
    timings, size_timings = run_benchmark(urls)

    print()
    print("=========================================")
    print("=== RESULTS ===")
    print("=========================================")

    report("All retrieves", timings)
    for sz in [80, 60, 5]:
        report(f"{sz}MB files", size_timings[sz])

    print()
    print("=== Benchmark complete ===")


if __name__ == "__main__":
    main()

Kind:

---
loader: taskgraph.loader.transform:loader

transforms:
    - gecko_taskgraph.transforms.io_benchmark:transforms
    - gecko_taskgraph.transforms.job:transforms
    - gecko_taskgraph.transforms.task:transforms

task-defaults:
    worker:
        docker-image:
            in-tree: "update-verify"
        max-run-time: 3600
    run:
        using: run-task
        checkout: false
        command:  # To be filled by transform
    treeherder:
        kind: test
        platform: linux64/opt
        tier: 3

tasks:
    amd-16:
        description: I/O cache-retrieve benchmark on b-linux-docker-amd
        worker-type: b-linux-docker-amd
        treeherder:
            symbol: IOB(amd16)

    amd-30:
        description: I/O cache-retrieve benchmark on b-linux-docker-large-amd
        worker-type: b-linux-docker-large-amd
        treeherder:
            symbol: IOB(amd30)

    intel:
        description: I/O cache-retrieve benchmark on b-linux
        worker-type: b-linux
        treeherder:
            symbol: IOB(intel)

Transform:

import os

from taskgraph.transforms.base import TransformSequence

transforms = TransformSequence()

SCRIPT_PATH = os.path.join(
    os.path.dirname(__file__), "..", "..", "scripts", "misc", "io-benchmark.py"
)


@transforms.add
def add_command(config, jobs):
    with open(SCRIPT_PATH) as f:
        script = f.read()

    for job in jobs:
        job["run"]["command"] = (
            "cat > /tmp/io-benchmark.py << 'BENCH_EOF'\n"
            f"{script}"
            "BENCH_EOF\n"
            "python3 -u /tmp/io-benchmark.py"
        )
        yield job

hneiva/update-verify-io-investigation.md

Select an option

No results found