tbg / distsender.md

Created March 24, 2026 13:43

DistSender Circuit Breaker Walkthrough

2026-03-24T13:35:44Z by Showboat dev

Overview

The DistSender circuit breaker prevents CockroachDB's DistSender from getting stuck on non-functional replicas. The DistSender normally relies on receiving a NotLeaseHolderError (NLHE) from a replica to redirect to other replicas. If

tbg / output.md

Created March 11, 2026 12:35

investigate workflow runs

/investigate workflow runs

113 non-skipped runs from 2026-02-19 to 2026-03-11.

Date	Who	Result	Issue	Title	Run
2026-03-11	rafiss	success	#163431	roachtest: ruby-pg failed [liveness session expired before transaction]	run
2026-03-10	williamchoe3	success	#165212	pkg/sql/opt/opbench/opbench_test_/opbench_test: pkg failed	run
2026-03-10	dt	success	#164906	roachtest: backup-restore/online-restore failed	run
2026-03-10	dt	success	#165013	backup: TestBackupRestoreCrossTab

tbg / heapscan.md

Created March 10, 2026 11:55

heapScan overestimate under GOGC=off + GOMEMLIMIT

Setup

Single-node CockroachDB (n2-standard-16, 64GB RAM) running a KV workload at ~20% CPU with GOGC=off and GOMEMLIMIT=51GiB. The live heap is ~480MB, but with GOGC disabled, the heap grows to ~50GB before GC triggers (driven entirely by the memory limit). GC runs roughly every 24 seconds.

The log output

tbg / gcpacer-tail.md

Created March 9, 2026 13:09

GC + pacer trace tail from tobias-gcassist (2026-03-09)

GC + Pacer trace tail (tobias-gcassist, 2026-03-09 ~12:45 UTC)

Single-node CockroachDB (n2-standard-16), KV workload at ~20-25% CPU. GODEBUG=gctrace=1,gcpacertrace=1.

pacer: assist ratio=+1.966144e+000 (scan 226 MB in 1660->1736 MB) workers=4++0.000000e+000
pacer: 27% CPU (25 exp.) for 151835216+1501680+2831530 B work (155682658 B exp.) in 1741434408 B -> 1766226960 B (∆goal -54389166, cons/mark +1.702709e-001)
gc 20311 @11135.890s 0%: 0.099+11+0.098 ms clock, 1.5+5.1/45/49+1.5 ms cpu, 1660->1684->434 MB, 1736 MB goal, 1 MB stacks, 2 MB globals, 16 P
pacer: sweep done at heap size 458MB; allocated 23MB during sweep; swept 218348 pages at +1.681737e-004 pages/byte

tbg / gc-assist-metric-issue.md

Created March 9, 2026 12:28

OTel Datadog exporter inflates counter metric rates by ~3x

Summary

The Datadog cockroachdb.sys.gc.assist.ns metric (and likely all Prometheus counter-type metrics) reports a rate ~3x higher than the actual rate when using .as_rate(). The root cause appears to be a mismatch between the OTel Prometheus scrape interval (30s) and the interval metadata submitted to Datadog by the OTel Datadog exporter (suspected 10s, matching the batch processor timeout).

tbg / experiment.md

Last active March 9, 2026 13:14

Single-Node KV Workload Experiment with GC tracing analysis

Single-Node KV Workload Experiment

2026-03-09T09:23:05Z by Showboat dev

Create a single-node CockroachDB cluster with OpenTelemetry and fluent-bit for Datadog observability, then run a KV workload targeting ~20-25% CPU.

Cluster Creation

tbg / review-pr-164900.md

Created March 5, 2026 09:54

Review of cockroachdb/cockroach PR #164900: mmaintegration: introduce physical capacity model

Review: PR #164900 — mmaintegration: introduce physical capacity model

Author: wenyihu6 | Branch: oldmodel2 | Epic: CRDB-55052

Blocking Issues (must fix)

[correctness] highDiskSpaceUtilization comment is now stale (capacity_model.go:703-724): The comment explains that fractionUsed = load/capacity = LogicalBytes / (LogicalBytes / diskUtil) = diskUtil. Under the new model, load=Used, capacity=Used+Available — the math still recovers actual disk utilization, but the comment references the old LogicalBytes-based derivation and is now misleading.
[correctness] minCapacity floor is dramatically lower than the old floor (physical_model.go): The old model had cpuCapacityFloorPerStore = 0.1 * 1e9 (0.1 cores). The new minCapacity = 1.0 means 1 ns/s — effectively zero CPU capacity. The old floor existed to prevent utilization from going to infinity on overloaded nodes (its comment explains this in detail). If a store has non-zero load and capacity=1 ns/s, utilization

tbg / review.md

Created March 4, 2026 10:15

review-crdb skill example: PR #161454 (engine separation ReadWriter)

Review: PR #161454 — kvserver: thread in correct engine when destroying and subsuming replicas

Summary

This PR replaces two uses of kvstorage.TODOReadWriter(b.batch) in replicaAppBatch.runPostAddTriggersReplicaOnly with a new b.ReadWriter() helper that correctly separates the state engine batch (b.batch) from the raft engine batch (b.RaftBatch()). This is part of the broader effort to logically separate the state and raft engines in the apply stack (issue #161059). The change is correct, small, and follows the pattern

tbg / review.md

Created March 4, 2026 10:15

review-crdb skill example: PR #79134 (SKIP LOCKED implementation)

Review: PR #79134 — kv: support FOR {UPDATE,SHARE} SKIP LOCKED

Summary

This PR implements the KV portion of SKIP LOCKED support for SELECT ... FOR UPDATE SKIP LOCKED and SELECT ... FOR SHARE SKIP LOCKED. The change spans the MVCC scanner, KV concurrency control, optimistic evaluation, timestamp cache, refresh spans, and the lock table. The SQL optimizer still rejects SKIP LOCKED (the SQL portion was extracted into a separate PR, #83627), so this is plumbing-only from the KV side.

tbg / review.md

Created March 4, 2026 10:15

review-crdb skill example: PR #164677 (connection retry roachtest)

Review: PR #164677 — changefeedccl: add roachtest for CDC rolling restarts with KV workload

Summary

This PR adds a roachtest that exercises changefeeds during rolling node drain+restart cycles and introduces a COCKROACH_CHANGEFEED_TESTING_SLOW_RETRY env var for reaching max backoff behavior quickly. The test is well-structured and the motivation is clear. There are a few structural and correctness issues worth addressing.

Tobias Grieger tbg

DistSender Circuit Breaker Walkthrough

Overview

/investigate workflow runs

heapScan overestimate under GOGC=off + GOMEMLIMIT

Setup

The log output

GC + Pacer trace tail (tobias-gcassist, 2026-03-09 ~12:45 UTC)

OTel Datadog exporter inflates counter metric rates by ~3x

Summary

Single-Node KV Workload Experiment

Cluster Creation

Review: PR #164900 — mmaintegration: introduce physical capacity model

Blocking Issues (must fix)

Review: PR #161454 — kvserver: thread in correct engine when destroying and subsuming replicas

Summary

Review: PR #79134 — kv: support FOR {UPDATE,SHARE} SKIP LOCKED

Summary

Review: PR #164677 — changefeedccl: add roachtest for CDC rolling restarts with KV workload

Summary