llm-d Metrics Documentation

This document provides an overview of all metrics generated by the llm-d components.

Overview

The llm-d system uses Prometheus as the primary metrics collection framework, with metrics covering inference performance, resource utilization, error rates, and energy consumption across multiple components.

Component Metrics

1. llm-d KV Cache Manager

Status: No Prometheus metrics currently implemented

The KV Cache Manager component does not currently expose Prometheus metrics directly. However, KV cache-related metrics are available through the Gateway API Inference Extension.

Metrics Location: All KV cache-related metrics are defined in gateway-api-inference-extension/pkg/epp/metrics/metrics.go

KV Cache Utilization Metrics

These metrics are exposed through the Gateway API Inference Extension but relate to KV cache functionality:

Metric Name	Type	Description	Labels
`inference_pool_average_kv_cache_utilization`	Gauge	Average KV cache utilization per pool	`name`
`inference_extension_prefix_indexer_size`	Gauge	Size of the prefix indexer	-
`inference_extension_prefix_indexer_hit_ratio`	Histogram	Cache hit ratio distribution	-
`inference_extension_prefix_indexer_hit_bytes`	Histogram	Cache hit length distribution	-

2. llm-d Inference Scheduler

Metrics Location: All inference scheduler metrics are defined in gateway-api-inference-extension/pkg/epp/metrics/metrics.go

The Inference Scheduler provides scheduling and performance metrics:

Scheduler Performance Metrics

Metric Name	Type	Description	Labels
`inference_extension_scheduler_e2e_duration_seconds`	Histogram	End-to-end scheduling latency	-
`inference_extension_scheduler_plugin_duration_seconds`	Histogram	Plugin processing latency	`plugin_type`, `plugin_name`
`inference_extension_request_control_plugin_duration_seconds`	Histogram	RequestControl plugin latency	`plugin_type`, `plugin_name`

Scheduler Backend Metrics

The scheduler scrapes metrics from individual inference server pods:

Metric Category	Description
Queue Size Metrics	Number of requests waiting in queue
KV Cache Utilization	Percentage of KV cache being used
LoRA Adapter Metrics	Running and waiting LoRA adapters
Maximum Active Models	Capacity information

3. llm-d Routing Sidecar

Status: No Prometheus metrics currently implemented

The routing sidecar component does not currently expose any Prometheus metrics. The component would need Prometheus client libraries added to implement metrics for routing operations such as:

Request routing latency
Routing success/failure rates
Target selection metrics
Connection pool utilization

4. Gateway API Inference Extension

Metrics Location: The Gateway API Inference Extension metrics are primarily defined in two files:

Main metrics: gateway-api-inference-extension/pkg/epp/metrics/metrics.go (most metrics)
Per-pod queue metrics: gateway-api-inference-extension/pkg/epp/metrics/collectors/inference_pool.go
BBR metrics: gateway-api-inference-extension/pkg/bbr/metrics/metrics.go

Inference Model Metrics (Subsystem: "inference_model")

Metric Name	Type	Description	Labels
`inference_model_request_total`	Counter	Total inference model requests	`model_name`, `target_model_name`
`inference_model_request_error_total`	Counter	Total inference model request errors	`model_name`, `target_model_name`, `error_code`
`inference_model_request_duration_seconds`	Histogram	Request latency distribution	`model_name`, `target_model_name`
`inference_model_request_sizes`	Histogram	Request size distribution (bytes)	`model_name`, `target_model_name`
`inference_model_response_sizes`	Histogram	Response size distribution (bytes)	`model_name`, `target_model_name`
`inference_model_input_tokens`	Histogram	Input token count distribution	`model_name`, `target_model_name`
`inference_model_output_tokens`	Histogram	Output token count distribution	`model_name`, `target_model_name`
`inference_model_normalized_time_per_output_token_seconds`	Histogram	Latency per output token	`model_name`, `target_model_name`
`inference_model_running_requests`	Gauge	Number of running requests	`model_name`

Inference Pool Metrics (Subsystem: "inference_pool")

Metric Name	Type	Description	Labels
`inference_pool_average_kv_cache_utilization`	Gauge	Average KV cache utilization	`name`
`inference_pool_average_queue_size`	Gauge	Average pending requests in queue	`name`
`inference_pool_ready_pods`	Gauge	Number of ready pods	`name`
`inference_pool_per_pod_queue_size`	Gauge	Queue size per pod	`name`, `model_server_pod`

Scheduler Metrics (Subsystem: "inference_extension")

Metric Name	Type	Description	Labels
`inference_extension_scheduler_e2e_duration_seconds`	Histogram	End-to-end scheduling latency	-
`inference_extension_scheduler_plugin_duration_seconds`	Histogram	Plugin processing latency	`plugin_type`, `plugin_name`
`inference_extension_request_control_plugin_duration_seconds`	Histogram	RequestControl plugin latency	`plugin_type`, `plugin_name`

Prefix Cache Metrics

Metric Name	Type	Description	Labels
`inference_extension_prefix_indexer_size`	Gauge	Prefix indexer size	-
`inference_extension_prefix_indexer_hit_ratio`	Histogram	Cache hit ratio	-
`inference_extension_prefix_indexer_hit_bytes`	Histogram	Cache hit bytes	-

Info Metrics

Metric Name	Type	Description	Labels
`inference_extension_info`	Gauge	Build information	`commit`, `build_ref`

BBR Component Metrics

Metric Name	Type	Description	Labels
`bbr_success_total`	Counter	Successful model name injections	-
`bbr_model_not_in_body_total`	Counter	Model not present in request body	-
`bbr_model_not_parsed_total`	Counter	Model parsing failures	-

2. VLLM Engine Metrics (v1)

Metrics Location: VLLM v1 metrics are defined in multiple files within the vllm/vllm/v1/metrics/ directory:

Main metrics: vllm/vllm/v1/metrics/loggers.py (most metrics)
Speculative decoding: vllm/vllm/v1/spec_decode/metrics.py

Note: LLM-D uses vLLM v1 which has different metric names than standard vLLM. Key differences:

Counter metrics drop the _total suffix (e.g., vllm:prompt_tokens vs vllm:prompt_tokens_total)
Prefix cache metrics are Counters instead of Histograms
Enhanced modular architecture with dedicated metrics modules

System State Gauges

Metric Name	Type	Description	Labels
`vllm:num_requests_running`	Gauge	Requests currently running on GPU	-
`vllm:num_requests_waiting`	Gauge	Requests waiting to be processed	-
`vllm:gpu_cache_usage_perc`	Gauge	GPU KV-cache usage percentage	-
`vllm:kv_cache_usage_perc`	Gauge	KV-cache usage percentage (V1)	-
`vllm:lora_requests_info`	Gauge	LoRA requests statistics	`running_lora_adapters`, `max_lora`, `waiting_lora_adapters`

Iteration Counters

Metric Name	Type	Description	Labels
`vllm:num_preemptions`	Counter	Total preemptions from engine	-
`vllm:prompt_tokens`	Counter	Total prefill tokens processed	-
`vllm:generation_tokens`	Counter	Total generation tokens processed	-

Iteration Histograms

Metric Name	Type	Description	Labels
`vllm:iteration_tokens_total`	Histogram	Tokens per engine step	-
`vllm:time_to_first_token_seconds`	Histogram	Time to first token	-
`vllm:time_per_output_token_seconds`	Histogram	Time per output token	-

Request Latency Histograms

Metric Name	Type	Description	Labels
`vllm:e2e_request_latency_seconds`	Histogram	End-to-end request latency	-
`vllm:request_queue_time_seconds`	Histogram	Time in WAITING phase	-
`vllm:request_inference_time_seconds`	Histogram	Time in RUNNING phase	-
`vllm:request_prefill_time_seconds`	Histogram	Time in PREFILL phase	-
`vllm:request_decode_time_seconds`	Histogram	Time in DECODE phase	-

Request Metadata Histograms

Metric Name	Type	Description	Labels
`vllm:request_prompt_tokens`	Histogram	Prefill tokens per request	-
`vllm:request_generation_tokens`	Histogram	Generation tokens per request	-
`vllm:request_max_num_generation_tokens`	Histogram	Maximum requested generation tokens	-
`vllm:request_params_n`	Histogram	Request parameter n	-
`vllm:request_params_max_tokens`	Histogram	Request parameter max_tokens	-

Request Success Counter

Metric Name	Type	Description	Labels
`vllm:request_success`	Counter	Successfully processed requests	`finished_reason`

Speculative Decoding Metrics

Metric Name	Type	Description	Labels
`vllm:spec_decode_num_drafts`	Counter	Number of drafts	-
`vllm:spec_decode_num_draft_tokens`	Counter	Draft tokens	-
`vllm:spec_decode_num_accepted_tokens`	Counter	Accepted tokens	-
`vllm:spec_decode_num_accepted_tokens_per_pos`	Counter	Accepted tokens per position	`position`

Prefix Cache Metrics (V1)

Metric Name	Type	Description	Labels
`vllm:prefix_cache_queries`	Counter	Prefix cache queries (tokens)	-
`vllm:prefix_cache_hits`	Counter	Prefix cache hits (tokens)	-
`vllm:cache_config_info`	Gauge	Cache configuration info	-

Summary

The LLM-D metrics system provides observability across the core components:

40+ distinct metrics covering inference performance, resource utilization, and system health
Metric types: Counters, Gauges, Histograms, and Info metrics
Core LLM-D Components:
- llm-d-kv-cache-manager: KV cache utilization metrics (via Gateway API Extension)
- llm-d-inference-scheduler: Scheduling latency and backend pod metrics
- llm-d-routing-sidecar: No Prometheus metrics implementation
- gateway-api-inference-extension: Comprehensive inference and pool metrics
- vllm (v1): Detailed inference performance and resource metrics
Special features: Speculative decoding, LoRA adapters, prefix caching
Framework: Prometheus with Ray metrics support and multiprocessing capabilities

sallyom/llm-d-metrics-summary.md

Select an option

No results found