Skip to content

Instantly share code, notes, and snippets.

@hneiva
Last active March 24, 2026 14:39
Show Gist options
  • Select an option

  • Save hneiva/76441dab6cc121702de2c483c07173ad to your computer and use it in GitHub Desktop.

Select an option

Save hneiva/76441dab6cc121702de2c483c07173ad to your computer and use it in GitHub Desktop.
Update Verify IO Performance Investigation

Update-Verify I/O Performance Investigation

Date: 2026-03-20 to 2026-03-24

TL;DR

Update-verify tasks run ~6 minutes slower on AMD workers (b-linux-docker-amd) because those GCP machines (c3d-standard-16-lssd) only get 1 NVMe SSD, while Intel workers (c2-standard-16) get 2 NVMe SSDs in an LVM stripe, giving ~2x I/O throughput for the large-file copy pattern.

It's a provisioning asymmetry, not a CPU/Docker/image issue. AMD machines with 2 NVMe drives (c3d-standard-30-lssd) match Intel IO performance exactly.


Context

Update-verify tasks on b-linux-docker-amd (c3d-standard-16-lssd) take longer than on b-linux (c2-standard-16). Source data points here.

Update-verify tasks run a cached_download.sh (link) loop that does grep + cp on cached files up to 120 MB in size, repeated up to 10,000 times per chunk.

On AMD workers (b-linux-docker-amd, c3d-standard-16-lssd), this workload runs about 6 minutes slower than on Intel workers (b-linux, c2-standard-16). (from ~39 to ~45 mins duration)


Investigation Timeline

Phase 1: GCP VM Benchmarks (bare metal)

Hypothesis: The c3d-standard-16-lssd hardware (CPU/NVMe) is inherently slower than c2-standard-16 for this I/O pattern.

We spun up matching GCP VMs and ran fio + application-level benchmarks (grep + cp on 50 files, 1000 iterations) directly on the VMs.

Result: Both machines performed identically.

Test c2 (Intel) c3d (AMD) Ratio
App-level avg (PD-SSD) 176 ms 178 ms 0.99x
App-level avg (NVMe) 149 ms 149 ms 1.00x
fio rand read IOPS (NVMe) 180,073 180,008 1.00x
fio rand write IOPS (NVMe) 100,029 100,022 1.00x

Verdict: Hardware ruled out. Both machine types have identical NVMe hardware (same vendor ID 0x1ae0, same IOPS caps).

Phase 2: Docker + Firefox CI Image

Hypothesis: Docker overlay2 or the Firefox CI image introduces overhead that affects c3d differently.

We loaded the actual update-verify:latest Docker image from Taskcluster artifacts and tested five configurations: bare VM (PD-SSD and NVMe), Docker overlay2, Docker with bind-mounted PD-SSD, Docker with bind-mounted NVMe.

Result: All five configurations within 3%. Docker adds <2ms overhead, same on both machines.

Configuration c2 (Intel) c3d (AMD) Ratio
Bare VM PD-SSD 171 ms 171 ms 1.00x
Bare VM NVMe 145 ms 150 ms 0.97x
Docker overlay2 169 ms 171 ms 0.99x
Docker + bind PD-SSD 170 ms 173 ms 0.98x
Docker + bind NVMe 150 ms 147 ms 1.02x

Verdict: Docker/overlay2 and CI image ruled out.

Phase 3: PD Disk Type Investigation

Hypothesis: The Taskcluster workers use different Persistent Disk types (pd-standard vs pd-balanced) which could explain the gap.

We checked with gcloud compute disks list and found AMD workers use pd-balanced while Intel workers use pd-standard. We re-ran VM benchmarks with matching disk types.

Result: pd-balanced is actually faster than pd-standard (3x more write IOPS). This would make AMD faster, not slower. But this turned out to be irrelevant entirely because...

Verdict: Boot disk type is irrelevant -- task workloads don't use the boot disk.

Phase 4: Worker Hardware Diffing Inspection

Hypothesis: Something in Taskcluster's worker provisioning is different.

We added disk/storage diagnostics (lsblk, findmnt, mount, cgroup checks) to the benchmark script and ran it on actual Taskcluster workers via try push.

This revealed the root cause.

Storage topology on live workers:

Intel (b-linux) AMD-16 (b-linux-docker-amd)
Local NVMe drives 2x 375G (nvme0n1, nvme0n2) 1x 375G (nvme1n1)
LVM volume 738G, stripe=32 369G, no stripe
Filesystem ext4, nobarrier, data=writeback ext4, nobarrier, data=writeback
cgroup I/O limits None None
cgroup memory limits None None

Intel workers have 2 NVMe SSDs in an LVM stripe giving ~2x sequential throughput. AMD workers have 1 NVMe SSD with no striping.

First Taskcluster benchmark results (tasks Qa_pc1spSEuJzg9q-92KBg and QgQJlSSUR_SkV7NE7qVaAg:

Metric Intel (2x NVMe) AMD-16 (1x NVMe) Ratio
Avg 76 ms 146 ms 1.92x

The 1.92x ratio matches the 2:1 NVMe drive count almost perfectly.

Phase 5: Confirmation with c3d-standard-30-lssd (2 NVMe)

Hypothesis: If the root cause is 1 vs 2 NVMe drives, then an AMD machine with 2 NVMe drives should match Intel.

The c3d-standard-16-lssd only supports 1 local SSD (fixed by GCP, cannot be changed). The c3d-standard-30-lssd (used by b-linux-docker-large-amd) gets 2 local SSDs. We added an amd-30 benchmark task and ran all three.

Tasks: O0wOvbnEQF6h8ptHBA6jbg (Intel), Yx4AjHXCRfyptBUIPTPtEg (AMD-30), WTIaalknRCmmcyO78GGIAg (AMD-16).

Result:

Intel (b-linux) AMD-30 (b-linux-docker-large-amd) AMD-16 (b-linux-docker-amd)
Machine c2-standard-16 c3d-standard-30-lssd c3d-standard-16-lssd
NVMe drives 2x 375G 2x 375G 1x 375G
LVM stripe stripe=32 stripe=32 no stripe
Avg 76 ms 77 ms 146 ms
P50 80 ms 82 ms 157 ms
P95 109 ms 115 ms 205 ms
P99 118 ms 149 ms 237 ms
Total (2000 ops) 151.9s 154.8s 291.4s
Ratio vs Intel 1.00x 1.01x 1.92x

AMD-30 with 2 NVMe drives matches Intel exactly (1.01x). The AMD-16 with 1 NVMe remains at 1.92x.


Hypotheses Tested and Eliminated

# Hypothesis How Tested Result
1 AMD CPU/NVMe hardware is slower fio + app benchmarks on matching VMs Identical performance
2 Docker overlay2 adds overhead on c3d 5 storage configs with Firefox CI image <3% difference
3 Firefox CI Docker image is the issue Used exact update-verify:latest No difference
4 PD boot disk type (pd-standard vs pd-balanced) Tested both, checked actual disk types Irrelevant -- tasks use local NVMe
5 cgroup I/O throttling Checked io.max, blkio.throttle.* on live workers No limits set
6 cgroup memory limits Checked memory.max on live workers No limits set
7 Docker daemon configuration differences Inspected mount options on live workers Identical ext4 options
8 fork/exec overhead Measured subprocess spawn latency 30 us difference (negligible)

Root Cause

The c3d-standard-16-lssd GCP machine type includes only 1 local NVMe SSD (375 GB). The c2-standard-16 is provisioned with 2 local NVMe SSDs (2x 375 GB) configured in an LVM stripe (stripe=32), giving it ~2x sequential I/O throughput.

The update-verify workload is dominated by sequential large-file copies (60-80 MB), which is exactly the pattern that benefits from LVM striping across multiple drives.

This is not a bug in Taskcluster, Docker, or the CI image. It is a provisioning asymmetry: Intel workers get 2 local SSDs while AMD-16 workers get 1, because the -lssd machine type variants have a fixed number of local SSDs determined by vCPU count:

  • c3d-standard-16-lssd: 1x 375G local SSD
  • c3d-standard-30-lssd: 2x 375G local SSD
  • c3d-standard-44-lssd: 4x 375G local SSD

https://docs.cloud.google.com/compute/docs/general-purpose-machines#c3d_machine_types

The c2 family allows attaching local SSDs independently of vCPU count, so c2-standard-16 can have 2 (or more) local SSDs attached.


Appendix: Test Infrastructure

Benchmark Script

Note: this is the final form of the script used to benchmark IO in tasks.

#!/usr/bin/env python3

"""
I/O benchmark that simulates the cache-retrieve pattern from update-verify.

The update-verify task caches downloaded MARs and tarballs on disk, then
retrieves them via cached_download.sh which does:
  1. grep -Fnx "$url" urls.list   (lookup URL in text index)
  2. cp obj_XXXXX.cache output    (copy 60-80MB file to working dir)

This is repeated ~9,784 times per chunk. On c3d-standard-16-lssd workers
(b-linux-docker-amd), this cp through Docker overlay2 shows 1.69x worse
tail latency than on c2-standard-16 (b-linux), adding ~14 minutes.

This benchmark reproduces that exact pattern to measure per-worker I/O.
"""

import os
import random
import shutil
import statistics
import subprocess
import sys
import time

BASE_DIR = "/builds/worker/workspace"
CACHE_DIR = os.path.join(BASE_DIR, "uv-cache")
WORK_DIR = os.path.join(BASE_DIR, "uv-work")
NUM_CACHE_FILES = 100
# 40 x 80MB (MARs) + 40 x 60MB (tarballs) + 20 x 5MB (small artifacts)
SIZES_MB = [80] * 40 + [60] * 40 + [5] * 20
NUM_RETRIEVES = 2000
MB = 1024 * 1024


def print_system_info():
    print("=== System info ===")
    os.system("uname -a")
    os.system('grep "model name" /proc/cpuinfo | head -1')
    os.system("free -h | head -2")
    os.system(f"df -Th {BASE_DIR}")
    os.system(f"findmnt {BASE_DIR} 2>/dev/null || true")
    os.system("lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT,ROTA,MODEL 2>/dev/null || true")
    os.system("cat /sys/block/*/queue/scheduler 2>/dev/null || true")
    os.system("mount | grep -E 'workspace|builds|overlay' || true")
    print()

    print("=== cgroup I/O limits ===")
    os.system("cat /sys/fs/cgroup/io.max 2>/dev/null || true")
    os.system("cat /sys/fs/cgroup/io.latency 2>/dev/null || true")
    os.system(
        "cat /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device 2>/dev/null || true"
    )
    os.system(
        "cat /sys/fs/cgroup/blkio/blkio.throttle.write_bps_device 2>/dev/null || true"
    )
    print()

    print("=== cgroup memory limits ===")
    os.system("cat /sys/fs/cgroup/memory.max 2>/dev/null || true")
    os.system("cat /sys/fs/cgroup/memory.current 2>/dev/null || true")
    print()


def populate_cache():
    print("=== Phase 1: Populate cache (simulates async_download.py) ===")

    os.makedirs(CACHE_DIR, exist_ok=True)
    os.makedirs(WORK_DIR, exist_ok=True)

    urls = []
    for i in range(NUM_CACHE_FILES):
        sz = SIZES_MB[i] * MB
        fname = os.path.join(CACHE_DIR, f"obj_{i+1:05d}.cache")
        url = (
            f"https://archive.mozilla.org/pub/firefox/fake/"
            f"{i+1}/firefox-{SIZES_MB[i]}MB.mar"
        )
        urls.append(url)
        with open(fname, "wb") as f:
            written = 0
            # Reuse one random chunk to avoid spending all time in urandom
            chunk = os.urandom(min(MB, sz))
            while written < sz:
                f.write(chunk)
                written += len(chunk)
        sys.stdout.write(f"\r  Created {i+1}/{NUM_CACHE_FILES} ({SIZES_MB[i]}MB)")
        sys.stdout.flush()
    print()

    urls_path = os.path.join(CACHE_DIR, "urls.list")
    with open(urls_path, "w") as f:
        f.writelines(u + "\n" for u in urls)
    print(f"  urls.list: {len(urls)} entries")

    total_gb = sum(SIZES_MB) / 1024
    print(f"  Total cache: {total_gb:.1f} GB")

    os.system("sync")
    os.system("echo 3 > /proc/sys/vm/drop_caches 2>/dev/null || true")
    print()
    return urls


def run_benchmark(urls):
    urls_path = os.path.join(CACHE_DIR, "urls.list")

    print("=== Phase 2: Cache retrieves (simulates cached_download.sh) ===")
    print(f"  {NUM_RETRIEVES} retrieves: grep urls.list + cp obj_XXXXX.cache")
    print()

    timings = []
    size_timings = {80: [], 60: [], 5: []}

    for r in range(NUM_RETRIEVES):
        idx = random.randint(0, NUM_CACHE_FILES - 1)
        url = urls[idx]
        sz = SIZES_MB[idx]
        cache_file = os.path.join(CACHE_DIR, f"obj_{idx+1:05d}.cache")
        out_file = os.path.join(WORK_DIR, "output.dat")

        t0 = time.monotonic()
        subprocess.run(["grep", "-Fnx", url, urls_path], check=False, capture_output=True)
        shutil.copy2(cache_file, out_file)
        t1 = time.monotonic()

        elapsed_ms = (t1 - t0) * 1000
        timings.append(elapsed_ms)
        size_timings[sz].append(elapsed_ms)

        if (r + 1) % 200 == 0:
            avg = statistics.mean(timings)
            print(f"  [{r+1:4d}/{NUM_RETRIEVES}] last={elapsed_ms:.0f}ms avg={avg:.0f}ms")

    return timings, size_timings


def report(label, data):
    if not data:
        return
    data.sort()
    n = len(data)
    total = sum(data)
    avg = total / n
    p50 = data[n * 50 // 100]
    p95 = data[n * 95 // 100]
    p99 = data[n * 99 // 100]

    print()
    print(f"--- {label} ---")
    print(f"  Count:  {n}")
    print(f"  Total:  {total:.0f}ms ({total/1000:.1f}s)")
    print(f"  Avg:    {avg:.0f}ms")
    print(f"  Min:    {data[0]:.0f}ms")
    print(f"  Max:    {data[-1]:.0f}ms")
    print(f"  P50:    {p50:.0f}ms")
    print(f"  P95:    {p95:.0f}ms")
    print(f"  P99:    {p99:.0f}ms")

    print()
    print("  Histogram:")
    buckets = {}
    for v in data:
        b = int(v // 100) * 100
        b = min(b, 2000)
        buckets[b] = buckets.get(b, 0) + 1
    for b in sorted(buckets):
        bar = "#" * min(buckets[b], 60)
        hi = f"{b+99}" if b < 2000 else "+"
        print(f"    {b:5d}-{hi:>5s}ms: {bar} ({buckets[b]})")


def main():
    print_system_info()
    urls = populate_cache()
    timings, size_timings = run_benchmark(urls)

    print()
    print("=========================================")
    print("=== RESULTS ===")
    print("=========================================")

    report("All retrieves", timings)
    for sz in [80, 60, 5]:
        report(f"{sz}MB files", size_timings[sz])

    print()
    print("=== Benchmark complete ===")


if __name__ == "__main__":
    main()

Kind:

---
loader: taskgraph.loader.transform:loader

transforms:
    - gecko_taskgraph.transforms.io_benchmark:transforms
    - gecko_taskgraph.transforms.job:transforms
    - gecko_taskgraph.transforms.task:transforms

task-defaults:
    worker:
        docker-image:
            in-tree: "update-verify"
        max-run-time: 3600
    run:
        using: run-task
        checkout: false
        command:  # To be filled by transform
    treeherder:
        kind: test
        platform: linux64/opt
        tier: 3

tasks:
    amd-16:
        description: I/O cache-retrieve benchmark on b-linux-docker-amd
        worker-type: b-linux-docker-amd
        treeherder:
            symbol: IOB(amd16)

    amd-30:
        description: I/O cache-retrieve benchmark on b-linux-docker-large-amd
        worker-type: b-linux-docker-large-amd
        treeherder:
            symbol: IOB(amd30)

    intel:
        description: I/O cache-retrieve benchmark on b-linux
        worker-type: b-linux
        treeherder:
            symbol: IOB(intel)

Transform:

import os

from taskgraph.transforms.base import TransformSequence

transforms = TransformSequence()

SCRIPT_PATH = os.path.join(
    os.path.dirname(__file__), "..", "..", "scripts", "misc", "io-benchmark.py"
)


@transforms.add
def add_command(config, jobs):
    with open(SCRIPT_PATH) as f:
        script = f.read()

    for job in jobs:
        job["run"]["command"] = (
            "cat > /tmp/io-benchmark.py << 'BENCH_EOF'\n"
            f"{script}"
            "BENCH_EOF\n"
            "python3 -u /tmp/io-benchmark.py"
        )
        yield job
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment