Skip to content

Instantly share code, notes, and snippets.

@sallyom
Last active July 12, 2025 00:56
Show Gist options
  • Save sallyom/a51e461e6caf56c8f1100cd711ac4685 to your computer and use it in GitHub Desktop.
Save sallyom/a51e461e6caf56c8f1100cd711ac4685 to your computer and use it in GitHub Desktop.
llm-d-metrics-overview

llm-d Metrics Documentation

This document provides an overview of all metrics generated by the llm-d components.

Overview

The llm-d system uses Prometheus as the primary metrics collection framework, with metrics covering inference performance, resource utilization, error rates, and energy consumption across multiple components.

Component Metrics

1. llm-d KV Cache Manager

Status: No Prometheus metrics currently implemented

The KV Cache Manager component does not currently expose Prometheus metrics directly. However, KV cache-related metrics are available through the Gateway API Inference Extension.

Metrics Location: All KV cache-related metrics are defined in gateway-api-inference-extension/pkg/epp/metrics/metrics.go

KV Cache Utilization Metrics

These metrics are exposed through the Gateway API Inference Extension but relate to KV cache functionality:

Metric Name Type Description Labels
inference_pool_average_kv_cache_utilization Gauge Average KV cache utilization per pool name
inference_extension_prefix_indexer_size Gauge Size of the prefix indexer -
inference_extension_prefix_indexer_hit_ratio Histogram Cache hit ratio distribution -
inference_extension_prefix_indexer_hit_bytes Histogram Cache hit length distribution -

2. llm-d Inference Scheduler

Metrics Location: All inference scheduler metrics are defined in gateway-api-inference-extension/pkg/epp/metrics/metrics.go

The Inference Scheduler provides scheduling and performance metrics:

Scheduler Performance Metrics

Metric Name Type Description Labels
inference_extension_scheduler_e2e_duration_seconds Histogram End-to-end scheduling latency -
inference_extension_scheduler_plugin_duration_seconds Histogram Plugin processing latency plugin_type, plugin_name
inference_extension_request_control_plugin_duration_seconds Histogram RequestControl plugin latency plugin_type, plugin_name

Scheduler Backend Metrics

The scheduler scrapes metrics from individual inference server pods:

Metric Category Description
Queue Size Metrics Number of requests waiting in queue
KV Cache Utilization Percentage of KV cache being used
LoRA Adapter Metrics Running and waiting LoRA adapters
Maximum Active Models Capacity information

3. llm-d Routing Sidecar

Status: No Prometheus metrics currently implemented

The routing sidecar component does not currently expose any Prometheus metrics. The component would need Prometheus client libraries added to implement metrics for routing operations such as:

  • Request routing latency
  • Routing success/failure rates
  • Target selection metrics
  • Connection pool utilization

4. Gateway API Inference Extension

Metrics Location: The Gateway API Inference Extension metrics are primarily defined in two files:

  • Main metrics: gateway-api-inference-extension/pkg/epp/metrics/metrics.go (most metrics)
  • Per-pod queue metrics: gateway-api-inference-extension/pkg/epp/metrics/collectors/inference_pool.go
  • BBR metrics: gateway-api-inference-extension/pkg/bbr/metrics/metrics.go

Inference Model Metrics (Subsystem: "inference_model")

Metric Name Type Description Labels
inference_model_request_total Counter Total inference model requests model_name, target_model_name
inference_model_request_error_total Counter Total inference model request errors model_name, target_model_name, error_code
inference_model_request_duration_seconds Histogram Request latency distribution model_name, target_model_name
inference_model_request_sizes Histogram Request size distribution (bytes) model_name, target_model_name
inference_model_response_sizes Histogram Response size distribution (bytes) model_name, target_model_name
inference_model_input_tokens Histogram Input token count distribution model_name, target_model_name
inference_model_output_tokens Histogram Output token count distribution model_name, target_model_name
inference_model_normalized_time_per_output_token_seconds Histogram Latency per output token model_name, target_model_name
inference_model_running_requests Gauge Number of running requests model_name

Inference Pool Metrics (Subsystem: "inference_pool")

Metric Name Type Description Labels
inference_pool_average_kv_cache_utilization Gauge Average KV cache utilization name
inference_pool_average_queue_size Gauge Average pending requests in queue name
inference_pool_ready_pods Gauge Number of ready pods name
inference_pool_per_pod_queue_size Gauge Queue size per pod name, model_server_pod

Scheduler Metrics (Subsystem: "inference_extension")

Metric Name Type Description Labels
inference_extension_scheduler_e2e_duration_seconds Histogram End-to-end scheduling latency -
inference_extension_scheduler_plugin_duration_seconds Histogram Plugin processing latency plugin_type, plugin_name
inference_extension_request_control_plugin_duration_seconds Histogram RequestControl plugin latency plugin_type, plugin_name

Prefix Cache Metrics

Metric Name Type Description Labels
inference_extension_prefix_indexer_size Gauge Prefix indexer size -
inference_extension_prefix_indexer_hit_ratio Histogram Cache hit ratio -
inference_extension_prefix_indexer_hit_bytes Histogram Cache hit bytes -

Info Metrics

Metric Name Type Description Labels
inference_extension_info Gauge Build information commit, build_ref

BBR Component Metrics

Metric Name Type Description Labels
bbr_success_total Counter Successful model name injections -
bbr_model_not_in_body_total Counter Model not present in request body -
bbr_model_not_parsed_total Counter Model parsing failures -

2. VLLM Engine Metrics (v1)

Metrics Location: VLLM v1 metrics are defined in multiple files within the vllm/vllm/v1/metrics/ directory:

  • Main metrics: vllm/vllm/v1/metrics/loggers.py (most metrics)
  • Speculative decoding: vllm/vllm/v1/spec_decode/metrics.py

Note: LLM-D uses vLLM v1 which has different metric names than standard vLLM. Key differences:

  • Counter metrics drop the _total suffix (e.g., vllm:prompt_tokens vs vllm:prompt_tokens_total)
  • Prefix cache metrics are Counters instead of Histograms
  • Enhanced modular architecture with dedicated metrics modules

System State Gauges

Metric Name Type Description Labels
vllm:num_requests_running Gauge Requests currently running on GPU -
vllm:num_requests_waiting Gauge Requests waiting to be processed -
vllm:gpu_cache_usage_perc Gauge GPU KV-cache usage percentage -
vllm:kv_cache_usage_perc Gauge KV-cache usage percentage (V1) -
vllm:lora_requests_info Gauge LoRA requests statistics running_lora_adapters, max_lora, waiting_lora_adapters

Iteration Counters

Metric Name Type Description Labels
vllm:num_preemptions Counter Total preemptions from engine -
vllm:prompt_tokens Counter Total prefill tokens processed -
vllm:generation_tokens Counter Total generation tokens processed -

Iteration Histograms

Metric Name Type Description Labels
vllm:iteration_tokens_total Histogram Tokens per engine step -
vllm:time_to_first_token_seconds Histogram Time to first token -
vllm:time_per_output_token_seconds Histogram Time per output token -

Request Latency Histograms

Metric Name Type Description Labels
vllm:e2e_request_latency_seconds Histogram End-to-end request latency -
vllm:request_queue_time_seconds Histogram Time in WAITING phase -
vllm:request_inference_time_seconds Histogram Time in RUNNING phase -
vllm:request_prefill_time_seconds Histogram Time in PREFILL phase -
vllm:request_decode_time_seconds Histogram Time in DECODE phase -

Request Metadata Histograms

Metric Name Type Description Labels
vllm:request_prompt_tokens Histogram Prefill tokens per request -
vllm:request_generation_tokens Histogram Generation tokens per request -
vllm:request_max_num_generation_tokens Histogram Maximum requested generation tokens -
vllm:request_params_n Histogram Request parameter n -
vllm:request_params_max_tokens Histogram Request parameter max_tokens -

Request Success Counter

Metric Name Type Description Labels
vllm:request_success Counter Successfully processed requests finished_reason

Speculative Decoding Metrics

Metric Name Type Description Labels
vllm:spec_decode_num_drafts Counter Number of drafts -
vllm:spec_decode_num_draft_tokens Counter Draft tokens -
vllm:spec_decode_num_accepted_tokens Counter Accepted tokens -
vllm:spec_decode_num_accepted_tokens_per_pos Counter Accepted tokens per position position

Prefix Cache Metrics (V1)

Metric Name Type Description Labels
vllm:prefix_cache_queries Counter Prefix cache queries (tokens) -
vllm:prefix_cache_hits Counter Prefix cache hits (tokens) -
vllm:cache_config_info Gauge Cache configuration info -

Summary

The LLM-D metrics system provides observability across the core components:

  • 40+ distinct metrics covering inference performance, resource utilization, and system health
  • Metric types: Counters, Gauges, Histograms, and Info metrics
  • Core LLM-D Components:
    • llm-d-kv-cache-manager: KV cache utilization metrics (via Gateway API Extension)
    • llm-d-inference-scheduler: Scheduling latency and backend pod metrics
    • llm-d-routing-sidecar: No Prometheus metrics implementation
    • gateway-api-inference-extension: Comprehensive inference and pool metrics
    • vllm (v1): Detailed inference performance and resource metrics
  • Special features: Speculative decoding, LoRA adapters, prefix caching
  • Framework: Prometheus with Ray metrics support and multiprocessing capabilities
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment