Skip to content

Instantly share code, notes, and snippets.

@bryantbiggs
Created February 20, 2026 14:46
Show Gist options
  • Select an option

  • Save bryantbiggs/62737e105525fe341090d0ad97de2178 to your computer and use it in GitHub Desktop.

Select an option

Save bryantbiggs/62737e105525fe341090d0ad97de2178 to your computer and use it in GitHub Desktop.
Minimal reproduction of OpenTelemetry Rust OTLP exporter deadlocks (issues #2802, #2715, #2539, #2071)

OpenTelemetry Rust OTLP Exporter Deadlock Reproduction

Minimal reproductions of deadlock issues when using OTLP exporters (tonic/gRPC) with the OpenTelemetry Rust SDK on constrained tokio runtimes.

Related issues:

Tested with opentelemetry_sdk v0.31.0, opentelemetry-otlp v0.31.0.

Root Cause

The default thread-based processors (PeriodicReader, BatchSpanProcessor, BatchLogProcessor) use dedicated OS threads that call futures_executor::block_on(exporter.export(...)).

When the exporter is tonic/gRPC, the export future needs the tokio reactor to drive HTTP/2 IO. The reactor is driven by tokio's worker threads. If all worker threads are blocked (by force_flush()/shutdown() calls), the reactor stalls and the export can never complete → deadlock.

No experimental features or special configuration is needed to hit this. The default code path with the published crates is affected.

Examples

1. PeriodicReader + current_thread runtime (TRUE DEADLOCK)

cargo run --example periodic_reader_current_thread

Uses current_thread runtime (same as #[tokio::test] default). Calls force_flush() which blocks the only tokio thread. The worker thread's tonic export can't complete because the reactor is stalled. Hangs forever.

2. PeriodicReader + multi_thread(1 worker) from tokio::spawn (TRUE DEADLOCK)

cargo run --example periodic_reader_multi_thread_1_worker

Simulates a 1-vCPU Kubernetes pod. Calls force_flush() from inside tokio::spawn, blocking the only worker thread. The entire runtime freezes — even tokio::time::sleep can't fire. Hangs forever.

Note: rt-tokio-current-thread does NOT help here — the runtime is multi_thread flavor.

3. BatchSpanProcessor + current_thread runtime (TIMEOUT)

cargo run --example batch_span_processor_current_thread

Same root cause as #1 but for spans. The BatchSpanProcessor has an internal 5-second timeout on force_flush, so it returns Err(Timeout(5s)) instead of hanging forever. However, the worker thread is permanently stuck — it will never recover.

4. Control: multi_thread with default workers (WORKS)

cargo run --example working_multi_thread

With multiple worker threads, other threads can drive the reactor while one is blocked. Completes immediately (with a connection error since no collector is running, but no hang).

//! Demonstrates deadlock: BatchSpanProcessor + tonic exporter on current_thread runtime.
//!
//! The BatchSpanProcessor's worker thread calls `futures_executor::block_on(exporter.export(...))`
//! without tokio runtime context. When using tonic/gRPC, the export future needs the tokio
//! reactor to drive HTTP/2 IO. On current_thread runtime, force_flush()/shutdown() blocks the
//! only tokio thread, preventing the reactor from driving the worker's export → deadlock.
//!
//! Reproduces: https://github.com/open-telemetry/opentelemetry-rust/issues/2715
//! https://github.com/open-telemetry/opentelemetry-rust/issues/2071
//!
//! Run: cargo run --example batch_span_processor_current_thread
//! Expected: Program hangs indefinitely (deadlock)
use opentelemetry::global;
use opentelemetry::trace::{Span, Tracer};
use opentelemetry_otlp::SpanExporter;
use opentelemetry_sdk::trace::SdkTracerProvider;
use opentelemetry_sdk::Resource;
use std::time::Duration;
fn main() {
let rt = tokio::runtime::Builder::new_current_thread()
.enable_all()
.build()
.unwrap();
rt.block_on(async {
println!("[batch_span/current_thread] Building tonic SpanExporter...");
let exporter = SpanExporter::builder()
.with_tonic()
.build()
.expect("failed to build exporter");
println!("[batch_span/current_thread] Building TracerProvider with batch exporter...");
let provider = SdkTracerProvider::builder()
.with_batch_exporter(exporter)
.with_resource(
Resource::builder()
.with_service_name("deadlock-repro")
.build(),
)
.build();
global::set_tracer_provider(provider.clone());
// Create some spans to ensure the batch processor has work to do
let tracer = global::tracer("test");
for i in 0..10 {
let mut span = tracer.start(format!("test-span-{}", i));
span.end();
}
// Small delay to let spans queue up
tokio::time::sleep(Duration::from_millis(100)).await;
println!("[batch_span/current_thread] Calling provider.force_flush()...");
println!("[batch_span/current_thread] *** This will hang/deadlock ***");
let start = std::time::Instant::now();
let result = provider.force_flush();
let elapsed = start.elapsed();
println!(
"[batch_span/current_thread] force_flush returned after {:.1}s: {:?}",
elapsed.as_secs_f64(),
result
);
});
}
[package]
name = "otel-deadlock-repro"
version = "0.1.0"
edition = "2021"
[dependencies]
opentelemetry = "0.31.0"
opentelemetry_sdk = { version = "0.31.0", features = ["rt-tokio", "rt-tokio-current-thread"] }
opentelemetry-otlp = { version = "0.31.0", features = ["grpc-tonic"] }
tokio = { version = "1", features = ["full"] }
[[example]]
name = "periodic_reader_current_thread"
path = "examples/periodic_reader_current_thread.rs"
[[example]]
name = "periodic_reader_multi_thread_1_worker"
path = "examples/periodic_reader_multi_thread_1_worker.rs"
[[example]]
name = "batch_span_processor_current_thread"
path = "examples/batch_span_processor_current_thread.rs"
[[example]]
name = "working_multi_thread"
path = "examples/working_multi_thread.rs"
// This crate contains examples demonstrating OTLP exporter deadlocks.
// See the examples/ directory for individual reproduction cases.
//! Demonstrates deadlock: PeriodicReader + tonic exporter on current_thread tokio runtime.
//!
//! The default PeriodicReader uses a dedicated OS thread that calls
//! `futures_executor::block_on(exporter.export(...))`. When the exporter is tonic/gRPC,
//! it needs the tokio reactor to drive IO. On a current_thread runtime, the reactor is
//! only driven by the single tokio thread. When force_flush()/shutdown() blocks that
//! thread waiting for the worker's response, the worker's export can't complete because
//! the reactor is stalled → deadlock.
//!
//! Reproduces: https://github.com/open-telemetry/opentelemetry-rust/issues/2802
//!
//! Run: cargo run --example periodic_reader_current_thread
//! Expected: Program hangs indefinitely on force_flush (deadlock)
use opentelemetry::metrics::MeterProvider;
use opentelemetry_otlp::MetricExporter;
use opentelemetry_sdk::metrics::{PeriodicReader, SdkMeterProvider};
use std::time::Duration;
fn main() {
// current_thread runtime — single threaded, same as #[tokio::test] default
let rt = tokio::runtime::Builder::new_current_thread()
.enable_all()
.build()
.unwrap();
rt.block_on(async {
println!("[current_thread] Building tonic MetricExporter...");
let exporter = MetricExporter::builder()
.with_tonic()
.build()
.expect("failed to build exporter");
println!("[current_thread] Building PeriodicReader (default, thread-based)...");
let reader = PeriodicReader::builder(exporter)
.with_interval(Duration::from_secs(120))
.build();
let provider = SdkMeterProvider::builder().with_reader(reader).build();
// Create actual metrics so there's data to export
let meter = provider.meter("deadlock-repro");
let counter = meter.u64_counter("test.counter").build();
counter.add(1, &[]);
counter.add(2, &[]);
counter.add(3, &[]);
println!("[current_thread] Recorded metrics. Calling force_flush()...");
println!("[current_thread] *** This will hang/deadlock ***");
println!("[current_thread] (force_flush blocks the tokio thread, preventing the reactor");
println!("[current_thread] from driving tonic's IO on the worker thread)");
let start = std::time::Instant::now();
// force_flush sends a message to the worker thread and blocks waiting for response.
// The worker thread tries to export via tonic, which needs the tokio reactor.
// But the reactor is on THIS thread, which is now blocked → deadlock.
let result = provider.force_flush();
let elapsed = start.elapsed();
println!(
"[current_thread] force_flush returned after {:.1}s: {:?}",
elapsed.as_secs_f64(),
result
);
});
}
//! Demonstrates deadlock: PeriodicReader + tonic exporter on multi_thread(1) from spawned task.
//!
//! When force_flush()/shutdown() is called from inside a tokio::spawn task on a
//! multi_thread runtime with only 1 worker thread (e.g., 1-vCPU k8s pod), the blocking
//! call occupies the only worker thread, preventing the reactor from driving tonic's IO
//! on the PeriodicReader's worker thread → deadlock.
//!
//! `rt-tokio-current-thread` does NOT help here — the runtime is multi_thread flavor.
//!
//! Reproduces: https://github.com/open-telemetry/opentelemetry-rust/issues/2802#issuecomment-3619094240
//!
//! Run: cargo run --example periodic_reader_multi_thread_1_worker
//! Expected: Program hangs indefinitely (deadlock)
use opentelemetry::metrics::MeterProvider;
use opentelemetry_otlp::MetricExporter;
use opentelemetry_sdk::metrics::{PeriodicReader, SdkMeterProvider};
use std::time::Duration;
fn main() {
// multi_thread runtime with only 1 worker — simulates 1-vCPU k8s pod
let rt = tokio::runtime::Builder::new_multi_thread()
.worker_threads(1)
.enable_all()
.build()
.unwrap();
rt.block_on(async {
println!("[multi_thread(1)] Building tonic MetricExporter...");
let exporter = MetricExporter::builder()
.with_tonic()
.build()
.expect("failed to build exporter");
println!("[multi_thread(1)] Building PeriodicReader (default, thread-based)...");
let reader = PeriodicReader::builder(exporter)
.with_interval(Duration::from_secs(120))
.build();
let provider = SdkMeterProvider::builder().with_reader(reader).build();
// Create actual metrics so there's data to export
let meter = provider.meter("deadlock-repro");
let counter = meter.u64_counter("test.counter").build();
counter.add(1, &[]);
println!("[multi_thread(1)] Calling force_flush() from inside tokio::spawn...");
println!("[multi_thread(1)] *** This will hang/deadlock ***");
println!("[multi_thread(1)] (The spawned task blocks the only worker thread,");
println!("[multi_thread(1)] preventing the reactor from driving tonic's IO)");
// Spawn a task that calls force_flush — this blocks the worker thread
let handle = tokio::spawn(async move {
let start = std::time::Instant::now();
let result = provider.force_flush();
let elapsed = start.elapsed();
println!(
"[multi_thread(1)] force_flush returned after {:.1}s: {:?}",
elapsed.as_secs_f64(),
result
);
});
// Wait with a timeout to demonstrate the hang
tokio::select! {
_ = handle => {
println!("[multi_thread(1)] Task completed (unexpected)");
}
_ = tokio::time::sleep(Duration::from_secs(10)) => {
println!("[multi_thread(1)] DEADLOCK: Task did not complete after 10s");
}
}
});
}
//! Control case: PeriodicReader + tonic exporter on multi_thread runtime with default workers.
//!
//! With multiple worker threads, the tokio reactor can be driven by other threads while
//! one thread is blocked in force_flush()/shutdown(). This allows the worker thread's
//! tonic export to complete.
//!
//! Run: cargo run --example working_multi_thread
//! Expected: force_flush and shutdown complete without hanging
use opentelemetry::metrics::MeterProvider;
use opentelemetry_otlp::MetricExporter;
use opentelemetry_sdk::metrics::{PeriodicReader, SdkMeterProvider};
use std::time::Duration;
fn main() {
// multi_thread runtime with default workers (typically num_cpus, minimum 2)
let rt = tokio::runtime::Builder::new_multi_thread()
.enable_all()
.build()
.unwrap();
rt.block_on(async {
println!("[multi_thread(default)] Building tonic MetricExporter...");
let exporter = MetricExporter::builder()
.with_tonic()
.build()
.expect("failed to build exporter");
println!("[multi_thread(default)] Building PeriodicReader (default, thread-based)...");
let reader = PeriodicReader::builder(exporter)
.with_interval(Duration::from_secs(120))
.build();
let provider = SdkMeterProvider::builder().with_reader(reader).build();
// Create actual metrics so there's data to export
let meter = provider.meter("deadlock-repro");
let counter = meter.u64_counter("test.counter").build();
counter.add(1, &[]);
counter.add(2, &[]);
counter.add(3, &[]);
println!("[multi_thread(default)] Recorded metrics. Calling force_flush()...");
let start = std::time::Instant::now();
let result = provider.force_flush();
let elapsed = start.elapsed();
println!(
"[multi_thread(default)] force_flush returned after {:.1}s: {:?}",
elapsed.as_secs_f64(),
result
);
println!("[multi_thread(default)] Calling shutdown()...");
let start = std::time::Instant::now();
let result = provider.shutdown();
let elapsed = start.elapsed();
println!(
"[multi_thread(default)] shutdown returned after {:.1}s: {:?}",
elapsed.as_secs_f64(),
result
);
println!("[multi_thread(default)] Success! No deadlock with multiple worker threads.");
});
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment