Skip to content

Instantly share code, notes, and snippets.

@nerdalert
Created May 31, 2025 02:14
Show Gist options
  • Save nerdalert/d985a6ea3a6c416771900a78e98b64f8 to your computer and use it in GitHub Desktop.
Save nerdalert/d985a6ea3a6c416771900a78e98b64f8 to your computer and use it in GitHub Desktop.
$ ENV_METADATA_GPU="4xNVIDIA_L40S" \
./e2e-bench-control.sh --4xgpu-minikube --model meta-llama/Llama-3.2-3B-Instruct

🌟 LLM Deployment and Benchmark Orchestrator 🌟
-------------------------------------------------
--- Configuration Summary ---
Minikube Start Args (Hardcoded): --driver docker --container-runtime docker --gpus all --memory no-limit --cpus no-limit
LLMD Installer Script (Hardcoded): ./llmd-installer.sh
Test Request Script (Hardcoded): ./test-request.sh (Args: --minikube, Retry: 30s)
Run Bench Script (Hardcoded): ./run-bench.sh

Benchmark model for metadata will be derived from the --model flag or dynamically from each values file.
Benchmark Metadata GPU (Configurable, consistent for all runs): 4xNVIDIA_L40S
Benchmark Metadata Gateway (Configurable): kgateway
Result File (Configurable): results.json
YAML Modifications for --4xgpu-minikube: ENABLED
YAML Model Override (from --model flag): ENABLED (Model: meta-llama/Llama-3.2-3B-Instruct)
Deployments to process: 3
  - Values: examples/no-features/no-features.yaml
  - Values: examples/base/base.yaml
  - Values: examples/kvcache/kvcache.yaml
-----------------------------
πŸ› οΈ Applying --4xgpu-minikube modifications to YAML files...
Processing examples/no-features/no-features.yaml for --4xgpu-minikube potential modification...
  Using default replica counts for examples/no-features/no-features.yaml.
  Successfully updated replicas and nodeSelectors in examples/no-features/no-features.yaml.
Processing examples/base/base.yaml for --4xgpu-minikube potential modification...
  Using default replica counts for examples/base/base.yaml.
  Successfully updated replicas and nodeSelectors in examples/base/base.yaml.
Processing examples/kvcache/kvcache.yaml for --4xgpu-minikube potential modification...
  Using default replica counts for examples/kvcache/kvcache.yaml.
  Successfully updated replicas and nodeSelectors in examples/kvcache/kvcache.yaml.
πŸ› οΈ --4xgpu-minikube modifications complete.
-------------------------------------
πŸ› οΈ Applying --model 'meta-llama/Llama-3.2-3B-Instruct' modifications to YAML files...
  Processing examples/no-features/no-features.yaml for model override...
    Successfully updated model in examples/no-features/no-features.yaml.
  Processing examples/base/base.yaml for model override...
    Successfully updated model in examples/base/base.yaml.
  Processing examples/kvcache/kvcache.yaml for model override...
    Successfully updated model in examples/kvcache/kvcache.yaml.
πŸ› οΈ Model override modifications complete.
-------------------------------------
========= Starting Full Deployment and Benchmark Process =========
--- Minikube Setup ---
Attempting to delete any existing Minikube instance if one exists...
Starting Minikube with: --driver docker --container-runtime docker --gpus all --memory no-limit --cpus no-limit
πŸš€ EXEC: minikube start --driver docker --container-runtime docker --gpus all --memory no-limit --cpus no-limit
πŸ˜„  minikube v1.35.0 on Ubuntu 24.04
✨  Using the docker driver based on user configuration
πŸ“Œ  Using Docker driver with root privileges
πŸ‘  Starting "minikube" primary control-plane node in "minikube" cluster
🚜  Pulling base image v0.0.46 ...
πŸ”₯  Creating docker container (CPUs=no-limit, Memory=no-limit) ...
🐳  Preparing Kubernetes v1.32.0 on Docker 27.4.1 ...
    β–ͺ Generating certificates and keys ...
    β–ͺ Booting up control plane ...
    β–ͺ Configuring RBAC rules ...
πŸ”—  Configuring bridge CNI (Container Networking Interface) ...
πŸ”Ž  Verifying Kubernetes components...
    β–ͺ Using image gcr.io/k8s-minikube/storage-provisioner:v5
    β–ͺ Using image nvcr.io/nvidia/k8s-device-plugin:v0.17.0
🌟  Enabled addons: nvidia-device-plugin, default-storageclass
πŸ„  Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default
Minikube started. Waiting for 10 seconds for stabilization...
Minikube setup complete.3
-------------------------

Processing Deployment 1/3: no-features
=================================================================
--- Installing LLM Deployment: no-features (using examples/no-features/no-features.yaml) ---
πŸš€ EXEC: ./llmd-installer.sh --minikube --values-file examples/no-features/no-features.yaml --disable-metrics-collection
ℹ️  πŸ“‚ Setting up script environment...
ℹ️  kubectl can reach to a running Kubernetes cluster.
βœ… HF_TOKEN validated
ℹ️  πŸ—οΈ Installing GAIE Kubernetes infrastructure…
βœ… πŸ“œ Base CRDs: Installing...
customresourcedefinition.apiextensions.k8s.io/gatewayclasses.gateway.networking.k8s.io created
customresourcedefinition.apiextensions.k8s.io/gateways.gateway.networking.k8s.io created
customresourcedefinition.apiextensions.k8s.io/grpcroutes.gateway.networking.k8s.io created
customresourcedefinition.apiextensions.k8s.io/httproutes.gateway.networking.k8s.io created
customresourcedefinition.apiextensions.k8s.io/referencegrants.gateway.networking.k8s.io created
βœ… πŸšͺ GAIE CRDs: Installing...
customresourcedefinition.apiextensions.k8s.io/inferencemodels.inference.networking.x-k8s.io created
customresourcedefinition.apiextensions.k8s.io/inferencepools.inference.networking.x-k8s.io created
βœ… πŸŽ’ Gateway provider 'kgateway': Installing...
Release "kgateway-crds" does not exist. Installing it now.
Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway-crds:v2.0.0
Digest: sha256:b405a0fbca50ae816bba355f1133cb456f280d9925d824166b7b6fc4e96f2077
NAME: kgateway-crds
LAST DEPLOYED: Fri May 30 19:32:41 2025
NAMESPACE: kgateway-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
Release "kgateway" does not exist. Installing it now.
Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway:v2.0.0
Digest: sha256:bbd7559eaa05ef6c27382390768889f5475e75bdcb4bd81ebd0f770cd14ab7a8
NAME: kgateway
LAST DEPLOYED: Fri May 30 19:32:42 2025
NAMESPACE: kgateway-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
βœ… GAIE infra applied
ℹ️  πŸ“¦ Creating namespace llm-d...
namespace/llm-d created
Context "minikube" modified.
βœ… Namespace ready
ℹ️  πŸ”Ή Using merged values: /tmp/tmp.BJddhCr3Ka
ℹ️  πŸ” Creating/updating HF token secret...
secret/llm-d-hf-token created
βœ… HF token secret created
ℹ️  Fetching OCP proxy UID...
ℹ️  No OpenShift SCC annotation found; defaulting PROXY_UID=0
ℹ️  πŸ“œ Applying modelservice CRD...
customresourcedefinition.apiextensions.k8s.io/modelservices.llm-d.ai created
βœ… ModelService CRD applied
ℹ️  ⏭️ Model download to PVC skipped: BYO model via HF repo_id selected.
protocol hf chosen - models will be downloaded JIT in inferencing pods.
"bitnami" already exists with the same configuration, skipping
ℹ️  πŸ› οΈ Building Helm chart dependencies...
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "prometheus-community" chart repository
...Successfully got an update from the "bitnami" chart repository
Update Complete. ⎈Happy Helming!⎈
Saving 2 charts
Downloading common from repo https://charts.bitnami.com/bitnami
Downloading redis from repo https://charts.bitnami.com/bitnami
Pulled: registry-1.docker.io/bitnamicharts/redis:20.13.4
Digest: sha256:6a389e13237e8e639ec0d445e785aa246b57bfce711b087033a196a291d5c8d7
Deleting outdated charts
βœ… Dependencies built
ℹ️  Metrics collection disabled by user request
ℹ️  Metrics collection disabled by user request
ℹ️  Metrics collection disabled by user request
ℹ️  🚚 Deploying llm-d chart with /tmp/tmp.BJddhCr3Ka...
Release "llm-d" does not exist. Installing it now.
NAME: llm-d
LAST DEPLOYED: Fri May 30 19:32:50 2025
NAMESPACE: llm-d
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Thank you for installing llm-d.

Your release is named `llm-d`.

To learn more about the release, try:

```bash
$ helm status llm-d
$ helm get all llm-d

Following presets are available to your users:

Name Description
basic-gpu-preset Basic gpu inference
basic-gpu-with-nixl-preset GPU inference with NIXL P/D KV transfer and cache offloading
basic-gpu-with-nixl-and-redis-lookup-preset GPU inference with NIXL P/D KV transfer, cache offloading and Redis lookup server
basic-sim-preset Basic simulation
βœ… llm-d deployed
βœ… πŸŽ‰ Installation complete.
Installation command for no-features sent.

--- Waiting for vLLM instance for deployment 'no-features' to initialize --- ℹ️ This step can take some time, as it may involve downloading large model files and then initializing the vLLM engine. Attempt 1/60: Waiting for 'no-features' to be ready. Retrying in 30 seconds... Attempt 2/60: Waiting for 'no-features' to be ready. Retrying in 30 seconds... Attempt 3/60: Waiting for 'no-features' to be ready. Retrying in 30 seconds... Attempt 4/60: Waiting for 'no-features' to be ready. Retrying in 30 seconds... Attempt 5/60: Waiting for 'no-features' to be ready. Retrying in 30 seconds... Attempt 6/60: Waiting for 'no-features' to be ready. Retrying in 30 seconds... Attempt 7/60: Waiting for 'no-features' to be ready. Retrying in 30 seconds... Attempt 8/60: Waiting for 'no-features' to be ready. Retrying in 30 seconds... Attempt 9/60: Waiting for 'no-features' to be ready. Retrying in 30 seconds... βœ… vLLM instance for deployment 'no-features' is ready!

--- Running Benchmark for: no-features (Model: meta-llama/Llama-3.2-3B-Instruct, Prefill: 0, Decode: 4, Input: 1000, Output: 500) --- Metadata: deployment=no-features gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 Result File: results.json πŸš€ EXEC: ./run-bench.sh --model meta-llama/Llama-3.2-3B-Instruct --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --dataset-name random --input-len 1000 --output-len 500 --request-rates 10,30,inf --metadata deployment=no-features gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 --result-file results.json secret/hf-token-secret created ▢️ Benchmarking MODEL=meta-llama/Llama-3.2-3B-Instruct at rates: 10 30 inf QPS for 30 seconds each πŸ”– Results will go into ./results.json πŸš€ Launching vllm-bench-job-10qps (QPS=10, prompts=300)… job.batch/vllm-bench-job-10qps created job.batch/vllm-bench-job-10qps condition met πŸ“– Logs from vllm-bench-job-10qps: Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN Starting benchmark at Fri May 30 19:46:38 UTC 2025 ----- ENV VARS ----- BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 MODEL=meta-llama/Llama-3.2-3B-Instruct DATASET_NAME=random RANDOM_INPUT_LEN=1000 RANDOM_OUTPUT_LEN=500 REQUEST_RATE=10 NUM_PROMPTS=300 IGNORE_EOS=true RESULT_FILENAME=results.json METADATA=deployment=no-features gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500

Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate 10 --num-prompts 300 --save-result --result-filename results.json --metadata deployment=no-features gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 19:46:47 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 19:46:47 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 19:46:47 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 19:46:47 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 19:46:47 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 19:46:47 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 19:46:47 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 19:46:47 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 19:46:48 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=300, logprobs=None, request_rate=10.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=no-features', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: 10.0 Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 300/300 [00:35<00:00, 8.37it/s] ============ Serving Benchmark Result ============ Successful requests: 300 Benchmark duration (s): 35.84 Total input tokens: 299700 Total generated tokens: 80945 Request throughput (req/s): 8.37 Output token throughput (tok/s): 2258.21 Total Token throughput (tok/s): 10619.28 ---------------Time to First Token---------------- Mean TTFT (ms): 54.60 Median TTFT (ms): 52.87 P99 TTFT (ms): 98.70 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 13.71 Median TPOT (ms): 13.69 P99 TPOT (ms): 15.79 ---------------Inter-token Latency---------------- Mean ITL (ms): 13.73 Median ITL (ms): 13.10 P99 ITL (ms): 34.49

<<<RESULT_START>>> {"date": "20250530-194738", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 300, "deployment": "no-features", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 10.0, "burstiness": 1.0, "max_concurrency": null, "duration": 35.84472489100017, "completed": 300, "total_input_tokens": 299700, "total_output_tokens": 80945, "request_throughput": 8.369432347779671, "request_goodput:": null, "output_throughput": 2258.212337970085, "total_token_throughput": 10619.275253401978, "mean_ttft_ms": 54.59561144333596, "median_ttft_ms": 52.868896000063614, "std_ttft_ms": 10.45418108071603, "p99_ttft_ms": 98.69780922984769, "mean_tpot_ms": 13.710971568239172, "median_tpot_ms": 13.686303201518745, "std_tpot_ms": 0.9966270621573491, "p99_tpot_ms": 15.793261194984588, "mean_itl_ms": 13.73187330695019, "median_itl_ms": 13.098381999952835, "std_itl_ms": 4.023339545674027, "p99_itl_ms": 34.48662216016601} <<<RESULT_END>>> Appended results block for 10 QPS Cleaning up Job vllm-bench-job-10qps... job.batch "vllm-bench-job-10qps" deleted πŸš€ Launching vllm-bench-job-30qps (QPS=30, prompts=900)… job.batch/vllm-bench-job-30qps created job.batch/vllm-bench-job-30qps condition met πŸ“– Logs from vllm-bench-job-30qps: Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN Starting benchmark at Fri May 30 19:47:42 UTC 2025 ----- ENV VARS ----- BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 MODEL=meta-llama/Llama-3.2-3B-Instruct DATASET_NAME=random RANDOM_INPUT_LEN=1000 RANDOM_OUTPUT_LEN=500 REQUEST_RATE=30 NUM_PROMPTS=900 IGNORE_EOS=true RESULT_FILENAME=results.json METADATA=deployment=no-features gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500

Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate 30 --num-prompts 900 --save-result --result-filename results.json --metadata deployment=no-features gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 19:47:51 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 19:47:51 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 19:47:51 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 19:47:51 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 19:47:51 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 19:47:51 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 19:47:51 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 19:47:51 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 19:47:52 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=900, logprobs=None, request_rate=30.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=no-features', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: 30.0 Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 900/900 [00:40<00:00, 22.10it/s] ============ Serving Benchmark Result ============ Successful requests: 900 Benchmark duration (s): 40.72 Total input tokens: 899100 Total generated tokens: 237873 Request throughput (req/s): 22.10 Output token throughput (tok/s): 5841.36 Total Token throughput (tok/s): 27920.21 ---------------Time to First Token---------------- Mean TTFT (ms): 80.19 Median TTFT (ms): 76.65 P99 TTFT (ms): 171.68 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 24.94 Median TPOT (ms): 25.47 P99 TPOT (ms): 35.58 ---------------Inter-token Latency---------------- Mean ITL (ms): 24.81 Median ITL (ms): 21.83 P99 ITL (ms): 74.24

<<<RESULT_START>>> {"date": "20250530-194848", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "no-features", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 30.0, "burstiness": 1.0, "max_concurrency": null, "duration": 40.722225800999695, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 237873, "request_throughput": 22.100953037245468, "request_goodput:": null, "output_throughput": 5841.355557587435, "total_token_throughput": 27920.20764179566, "mean_ttft_ms": 80.19281090889662, "median_ttft_ms": 76.65224249990388, "std_ttft_ms": 26.85547410843286, "p99_ttft_ms": 171.67865642984907, "mean_tpot_ms": 24.942996925599196, "median_tpot_ms": 25.466091678535058, "std_tpot_ms": 5.279276119470118, "p99_tpot_ms": 35.581390495641394, "mean_itl_ms": 24.814122015140974, "median_itl_ms": 21.825537999575317, "std_itl_ms": 11.660791168594844, "p99_itl_ms": 74.24358160013071} <<<RESULT_END>>> Appended results block for 30 QPS Cleaning up Job vllm-bench-job-30qps... job.batch "vllm-bench-job-30qps" deleted πŸš€ Launching vllm-bench-job-inf (infinite QPS, prompts=900)… job.batch/vllm-bench-job-inf created job.batch/vllm-bench-job-inf condition met πŸ“– Logs from vllm-bench-job-inf: Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN Starting benchmark at Fri May 30 19:48:52 UTC 2025 ----- ENV VARS ----- BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 MODEL=meta-llama/Llama-3.2-3B-Instruct DATASET_NAME=random RANDOM_INPUT_LEN=1000 RANDOM_OUTPUT_LEN=500 REQUEST_RATE=inf NUM_PROMPTS=900 IGNORE_EOS=true RESULT_FILENAME=results.json METADATA=deployment=no-features gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500

Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate inf --num-prompts 900 --save-result --result-filename results.json --metadata deployment=no-features gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 19:49:02 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 19:49:02 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 19:49:02 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 19:49:02 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 19:49:02 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 19:49:02 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 19:49:02 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 19:49:02 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 19:49:03 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=900, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=no-features', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: inf Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 900/900 [00:28<00:00, 31.18it/s] ============ Serving Benchmark Result ============ Successful requests: 900 Benchmark duration (s): 28.86 Total input tokens: 899100 Total generated tokens: 240165 Request throughput (req/s): 31.18 Output token throughput (tok/s): 8321.23 Total Token throughput (tok/s): 39473.22 ---------------Time to First Token---------------- Mean TTFT (ms): 4386.74 Median TTFT (ms): 4066.16 P99 TTFT (ms): 9951.54 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 61.81 Median TPOT (ms): 53.14 P99 TPOT (ms): 118.56 ---------------Inter-token Latency---------------- Mean ITL (ms): 48.17 Median ITL (ms): 37.26 P99 ITL (ms): 134.97

<<<RESULT_START>>> {"date": "20250530-194947", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "no-features", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 28.861716763000004, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 240165, "request_throughput": 31.183176225808488, "request_goodput:": null, "output_throughput": 8321.230575856995, "total_token_throughput": 39473.22362543967, "mean_ttft_ms": 4386.739448253334, "median_ttft_ms": 4066.1594384998807, "std_ttft_ms": 2469.091218460141, "p99_ttft_ms": 9951.535929819987, "mean_tpot_ms": 61.81461881712581, "median_tpot_ms": 53.13592476507745, "std_tpot_ms": 22.250639166144722, "p99_tpot_ms": 118.56057960224177, "mean_itl_ms": 48.16794500382424, "median_itl_ms": 37.264703000346344, "std_itl_ms": 26.871669857570474, "p99_itl_ms": 134.96996935997225} <<<RESULT_END>>> Appended results block for infinite QPS Cleaning up Job vllm-bench-job-inf... job.batch "vllm-bench-job-inf" deleted βœ… All benchmarks complete. Combined results in ./results.json Benchmark for no-features completed.

--- Uninstalling LLM Deployment: no-features --- πŸš€ EXEC: ./llmd-installer.sh --minikube --uninstall ℹ️ πŸ“‚ Setting up script environment... ℹ️ kubectl can reach to a running Kubernetes cluster. ℹ️ πŸ—‘οΈ Tearing down GAIE Kubernetes infrastructure… βœ… πŸ“œ Base CRDs: Deleting... customresourcedefinition.apiextensions.k8s.io "gatewayclasses.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "gateways.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "grpcroutes.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "httproutes.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "referencegrants.gateway.networking.k8s.io" deleted βœ… πŸšͺ GAIE CRDs: Deleting... customresourcedefinition.apiextensions.k8s.io "inferencemodels.inference.networking.x-k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "inferencepools.inference.networking.x-k8s.io" deleted βœ… πŸŽ’ Gateway provider 'kgateway': Deleting... release "kgateway" uninstalled release "kgateway-crds" uninstalled Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway-crds:v2.0.0 Digest: sha256:b405a0fbca50ae816bba355f1133cb456f280d9925d824166b7b6fc4e96f2077 ℹ️ πŸ—‘οΈ Uninstalling llm-d chart... release "llm-d" uninstalled ℹ️ πŸ—‘οΈ Deleting namespace llm-d... namespace "llm-d" deleted ℹ️ πŸ—‘οΈ Deleting monitoring namespace... ℹ️ πŸ—‘οΈ Deleting Minikube hostPath PV (model-hostpath-pv)... ℹ️ πŸ—‘οΈ Deleting ClusterRoleBinding llm-d No resources found βœ… πŸ’€ Uninstallation complete Uninstallation for no-features completed.

Completed cycle for no-features.

Pausing for a few seconds before next deployment...

Processing Deployment 2/3: base

--- Installing LLM Deployment: base (using examples/base/base.yaml) --- πŸš€ EXEC: ./llmd-installer.sh --minikube --values-file examples/base/base.yaml --disable-metrics-collection ℹ️ πŸ“‚ Setting up script environment... ℹ️ kubectl can reach to a running Kubernetes cluster. βœ… HF_TOKEN validated ℹ️ πŸ—οΈ Installing GAIE Kubernetes infrastructure… βœ… πŸ“œ Base CRDs: Installing... customresourcedefinition.apiextensions.k8s.io/gatewayclasses.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/gateways.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/grpcroutes.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/httproutes.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/referencegrants.gateway.networking.k8s.io created βœ… πŸšͺ GAIE CRDs: Installing... customresourcedefinition.apiextensions.k8s.io/inferencemodels.inference.networking.x-k8s.io created customresourcedefinition.apiextensions.k8s.io/inferencepools.inference.networking.x-k8s.io created βœ… πŸŽ’ Gateway provider 'kgateway': Installing... Release "kgateway-crds" does not exist. Installing it now. Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway-crds:v2.0.0 Digest: sha256:b405a0fbca50ae816bba355f1133cb456f280d9925d824166b7b6fc4e96f2077 NAME: kgateway-crds LAST DEPLOYED: Fri May 30 19:50:09 2025 NAMESPACE: kgateway-system STATUS: deployed REVISION: 1 TEST SUITE: None Release "kgateway" does not exist. Installing it now. Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway:v2.0.0 Digest: sha256:bbd7559eaa05ef6c27382390768889f5475e75bdcb4bd81ebd0f770cd14ab7a8 NAME: kgateway LAST DEPLOYED: Fri May 30 19:50:10 2025 NAMESPACE: kgateway-system STATUS: deployed REVISION: 1 TEST SUITE: None βœ… GAIE infra applied ℹ️ πŸ“¦ Creating namespace llm-d... namespace/llm-d created Context "minikube" modified. βœ… Namespace ready ℹ️ πŸ”Ή Using merged values: /tmp/tmp.ECcof8e3Jm ℹ️ πŸ” Creating/updating HF token secret... secret/llm-d-hf-token created βœ… HF token secret created ℹ️ Fetching OCP proxy UID... ℹ️ No OpenShift SCC annotation found; defaulting PROXY_UID=0 ℹ️ πŸ“œ Applying modelservice CRD... customresourcedefinition.apiextensions.k8s.io/modelservices.llm-d.ai unchanged βœ… ModelService CRD applied ℹ️ ⏭️ Model download to PVC skipped: BYO model via HF repo_id selected. protocol hf chosen - models will be downloaded JIT in inferencing pods. "bitnami" already exists with the same configuration, skipping ℹ️ πŸ› οΈ Building Helm chart dependencies... Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "prometheus-community" chart repository ...Successfully got an update from the "bitnami" chart repository Update Complete. ⎈Happy Helming!⎈ Saving 2 charts Downloading common from repo https://charts.bitnami.com/bitnami Downloading redis from repo https://charts.bitnami.com/bitnami Pulled: registry-1.docker.io/bitnamicharts/redis:20.13.4 Digest: sha256:6a389e13237e8e639ec0d445e785aa246b57bfce711b087033a196a291d5c8d7 Deleting outdated charts βœ… Dependencies built ℹ️ Metrics collection disabled by user request ℹ️ Metrics collection disabled by user request ℹ️ Metrics collection disabled by user request ℹ️ 🚚 Deploying llm-d chart with /tmp/tmp.ECcof8e3Jm... Release "llm-d" does not exist. Installing it now. NAME: llm-d LAST DEPLOYED: Fri May 30 19:50:18 2025 NAMESPACE: llm-d STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: Thank you for installing llm-d.

Your release is named llm-d.

To learn more about the release, try:

$ helm status llm-d
$ helm get all llm-d

Following presets are available to your users:

Name Description
basic-gpu-preset Basic gpu inference
basic-gpu-with-nixl-preset GPU inference with NIXL P/D KV transfer and cache offloading
basic-gpu-with-nixl-and-redis-lookup-preset GPU inference with NIXL P/D KV transfer, cache offloading and Redis lookup server
basic-sim-preset Basic simulation
βœ… llm-d deployed
βœ… πŸŽ‰ Installation complete.
Installation command for base sent.

--- Waiting for vLLM instance for deployment 'base' to initialize --- ℹ️ This step can take some time, as it may involve downloading large model files and then initializing the vLLM engine. Attempt 1/60: Waiting for 'base' to be ready. Retrying in 30 seconds... Attempt 2/60: Waiting for 'base' to be ready. Retrying in 30 seconds... Attempt 3/60: Waiting for 'base' to be ready. Retrying in 30 seconds... Attempt 4/60: Waiting for 'base' to be ready. Retrying in 30 seconds... Attempt 5/60: Waiting for 'base' to be ready. Retrying in 30 seconds... Attempt 6/60: Waiting for 'base' to be ready. Retrying in 30 seconds... Attempt 7/60: Waiting for 'base' to be ready. Retrying in 30 seconds... βœ… vLLM instance for deployment 'base' is ready!

--- Running Benchmark for: base (Model: meta-llama/Llama-3.2-3B-Instruct, Prefill: 0, Decode: 4, Input: 1000, Output: 500) --- Metadata: deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 Result File: results.json πŸš€ EXEC: ./run-bench.sh --model meta-llama/Llama-3.2-3B-Instruct --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --dataset-name random --input-len 1000 --output-len 500 --request-rates 10,30,inf --metadata deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 --result-file results.json secret/hf-token-secret created ▢️ Benchmarking MODEL=meta-llama/Llama-3.2-3B-Instruct at rates: 10 30 inf QPS for 30 seconds each πŸ”– Results will go into ./results.json πŸš€ Launching vllm-bench-job-10qps (QPS=10, prompts=300)… job.batch/vllm-bench-job-10qps created job.batch/vllm-bench-job-10qps condition met πŸ“– Logs from vllm-bench-job-10qps: Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN Starting benchmark at Fri May 30 19:58:22 UTC 2025 ----- ENV VARS ----- BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 MODEL=meta-llama/Llama-3.2-3B-Instruct DATASET_NAME=random RANDOM_INPUT_LEN=1000 RANDOM_OUTPUT_LEN=500 REQUEST_RATE=10 NUM_PROMPTS=300 IGNORE_EOS=true RESULT_FILENAME=results.json METADATA=deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500

Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate 10 --num-prompts 300 --save-result --result-filename results.json --metadata deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 19:58:31 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 19:58:31 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 19:58:31 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 19:58:31 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 19:58:31 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 19:58:31 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 19:58:31 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 19:58:31 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 19:58:32 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=300, logprobs=None, request_rate=10.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=base', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: 10.0 Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 300/300 [00:36<00:00, 8.29it/s] ============ Serving Benchmark Result ============ Successful requests: 300 Benchmark duration (s): 36.20 Total input tokens: 299700 Total generated tokens: 82608 Request throughput (req/s): 8.29 Output token throughput (tok/s): 2282.27 Total Token throughput (tok/s): 10562.32 ---------------Time to First Token---------------- Mean TTFT (ms): 54.59 Median TTFT (ms): 52.79 P99 TTFT (ms): 89.80 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 13.81 Median TPOT (ms): 13.62 P99 TPOT (ms): 17.14 ---------------Inter-token Latency---------------- Mean ITL (ms): 13.79 Median ITL (ms): 13.05 P99 ITL (ms): 34.87

<<<RESULT_START>>> {"date": "20250530-195922", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 300, "deployment": "base", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 10.0, "burstiness": 1.0, "max_concurrency": null, "duration": 36.19546746500009, "completed": 300, "total_input_tokens": 299700, "total_output_tokens": 82608, "request_throughput": 8.288330584211705, "request_goodput:": null, "output_throughput": 2282.2747096685353, "total_token_throughput": 10562.316963296029, "mean_ttft_ms": 54.592403276680365, "median_ttft_ms": 52.79431650001243, "std_ttft_ms": 9.218280749895815, "p99_ttft_ms": 89.80040712966002, "mean_tpot_ms": 13.81029661989983, "median_tpot_ms": 13.619587584448205, "std_tpot_ms": 1.1567995163110574, "p99_tpot_ms": 17.1377381639993, "mean_itl_ms": 13.787254072508166, "median_itl_ms": 13.049414999841247, "std_itl_ms": 4.111252531976249, "p99_itl_ms": 34.86666600036186} <<<RESULT_END>>> Appended results block for 10 QPS Cleaning up Job vllm-bench-job-10qps... job.batch "vllm-bench-job-10qps" deleted πŸš€ Launching vllm-bench-job-30qps (QPS=30, prompts=900)… job.batch/vllm-bench-job-30qps created job.batch/vllm-bench-job-30qps condition met πŸ“– Logs from vllm-bench-job-30qps: Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN Starting benchmark at Fri May 30 19:59:26 UTC 2025 ----- ENV VARS ----- BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 MODEL=meta-llama/Llama-3.2-3B-Instruct DATASET_NAME=random RANDOM_INPUT_LEN=1000 RANDOM_OUTPUT_LEN=500 REQUEST_RATE=30 NUM_PROMPTS=900 IGNORE_EOS=true RESULT_FILENAME=results.json METADATA=deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500

Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate 30 --num-prompts 900 --save-result --result-filename results.json --metadata deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 19:59:35 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 19:59:35 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 19:59:35 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 19:59:35 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 19:59:35 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 19:59:35 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 19:59:35 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 19:59:35 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 19:59:37 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=900, logprobs=None, request_rate=30.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=base', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: 30.0 Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 900/900 [00:40<00:00, 22.07it/s] ============ Serving Benchmark Result ============ Successful requests: 900 Benchmark duration (s): 40.77 Total input tokens: 899100 Total generated tokens: 240128 Request throughput (req/s): 22.07 Output token throughput (tok/s): 5889.41 Total Token throughput (tok/s): 27940.84 ---------------Time to First Token---------------- Mean TTFT (ms): 73.28 Median TTFT (ms): 72.59 P99 TTFT (ms): 182.99 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 23.98 Median TPOT (ms): 24.20 P99 TPOT (ms): 35.32 ---------------Inter-token Latency---------------- Mean ITL (ms): 23.91 Median ITL (ms): 21.30 P99 ITL (ms): 75.28

<<<RESULT_START>>> {"date": "20250530-200032", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "base", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 30.0, "burstiness": 1.0, "max_concurrency": null, "duration": 40.77285659999961, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 240128, "request_throughput": 22.073508580215805, "request_goodput:": null, "output_throughput": 5889.408298166733, "total_token_throughput": 27940.843369802322, "mean_ttft_ms": 73.27519322778042, "median_ttft_ms": 72.58796349969998, "std_ttft_ms": 34.88893424361722, "p99_ttft_ms": 182.99465706019868, "mean_tpot_ms": 23.98439489090563, "median_tpot_ms": 24.200584621242403, "std_tpot_ms": 6.2426528186199635, "p99_tpot_ms": 35.31807143475544, "mean_itl_ms": 23.91148948750557, "median_itl_ms": 21.303171499766904, "std_itl_ms": 11.541504488505524, "p99_itl_ms": 75.27688499983014} <<<RESULT_END>>> Appended results block for 30 QPS Cleaning up Job vllm-bench-job-30qps... job.batch "vllm-bench-job-30qps" deleted πŸš€ Launching vllm-bench-job-inf (infinite QPS, prompts=900)… job.batch/vllm-bench-job-inf created job.batch/vllm-bench-job-inf condition met πŸ“– Logs from vllm-bench-job-inf: Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN Starting benchmark at Fri May 30 20:00:36 UTC 2025 ----- ENV VARS ----- BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 MODEL=meta-llama/Llama-3.2-3B-Instruct DATASET_NAME=random RANDOM_INPUT_LEN=1000 RANDOM_OUTPUT_LEN=500 REQUEST_RATE=inf NUM_PROMPTS=900 IGNORE_EOS=true RESULT_FILENAME=results.json METADATA=deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500

Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate inf --num-prompts 900 --save-result --result-filename results.json --metadata deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 20:00:46 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 20:00:46 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 20:00:46 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 20:00:46 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 20:00:46 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 20:00:46 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 20:00:46 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 20:00:46 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 20:00:47 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=900, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=base', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: inf Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 900/900 [00:23<00:00, 37.76it/s] ============ Serving Benchmark Result ============ Successful requests: 900 Benchmark duration (s): 23.83 Total input tokens: 899100 Total generated tokens: 241786 Request throughput (req/s): 37.76 Output token throughput (tok/s): 10145.34 Total Token throughput (tok/s): 47871.58 ---------------Time to First Token---------------- Mean TTFT (ms): 1253.67 Median TTFT (ms): 937.95 P99 TTFT (ms): 3155.81 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 46.33 Median TPOT (ms): 43.75 P99 TPOT (ms): 97.13 ---------------Inter-token Latency---------------- Mean ITL (ms): 40.97 Median ITL (ms): 37.63 P99 ITL (ms): 86.79

<<<RESULT_START>>> {"date": "20250530-200126", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "base", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 23.8322206619996, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 241786, "request_throughput": 37.76400079389359, "request_goodput:": null, "output_throughput": 10145.340773280393, "total_token_throughput": 47871.57756638009, "mean_ttft_ms": 1253.6749797966502, "median_ttft_ms": 937.9455774997041, "std_ttft_ms": 750.4385642809029, "p99_ttft_ms": 3155.812861540362, "mean_tpot_ms": 46.333835858714664, "median_tpot_ms": 43.749258578274386, "std_tpot_ms": 12.805014931694936, "p99_tpot_ms": 97.13205872092182, "mean_itl_ms": 40.968254144283954, "median_itl_ms": 37.628544499966665, "std_itl_ms": 11.948914401781659, "p99_itl_ms": 86.78916550024948} <<<RESULT_END>>> Appended results block for infinite QPS Cleaning up Job vllm-bench-job-inf... job.batch "vllm-bench-job-inf" deleted βœ… All benchmarks complete. Combined results in ./results.json Benchmark for base completed.

--- Uninstalling LLM Deployment: base --- πŸš€ EXEC: ./llmd-installer.sh --minikube --uninstall ℹ️ πŸ“‚ Setting up script environment... ℹ️ kubectl can reach to a running Kubernetes cluster. ℹ️ πŸ—‘οΈ Tearing down GAIE Kubernetes infrastructure… βœ… πŸ“œ Base CRDs: Deleting... customresourcedefinition.apiextensions.k8s.io "gatewayclasses.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "gateways.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "grpcroutes.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "httproutes.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "referencegrants.gateway.networking.k8s.io" deleted βœ… πŸšͺ GAIE CRDs: Deleting... customresourcedefinition.apiextensions.k8s.io "inferencemodels.inference.networking.x-k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "inferencepools.inference.networking.x-k8s.io" deleted βœ… πŸŽ’ Gateway provider 'kgateway': Deleting... release "kgateway" uninstalled release "kgateway-crds" uninstalled Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway-crds:v2.0.0 Digest: sha256:b405a0fbca50ae816bba355f1133cb456f280d9925d824166b7b6fc4e96f2077 ℹ️ πŸ—‘οΈ Uninstalling llm-d chart... release "llm-d" uninstalled ℹ️ πŸ—‘οΈ Deleting namespace llm-d... namespace "llm-d" deleted ℹ️ πŸ—‘οΈ Deleting monitoring namespace... ℹ️ πŸ—‘οΈ Deleting Minikube hostPath PV (model-hostpath-pv)... ℹ️ πŸ—‘οΈ Deleting ClusterRoleBinding llm-d No resources found βœ… πŸ’€ Uninstallation complete Uninstallation for base completed.

Completed cycle for base.

Pausing for a few seconds before next deployment...

Processing Deployment 3/3: kvcache

--- Installing LLM Deployment: kvcache (using examples/kvcache/kvcache.yaml) --- πŸš€ EXEC: ./llmd-installer.sh --minikube --values-file examples/kvcache/kvcache.yaml --disable-metrics-collection ℹ️ πŸ“‚ Setting up script environment... ℹ️ kubectl can reach to a running Kubernetes cluster. βœ… HF_TOKEN validated ℹ️ πŸ—οΈ Installing GAIE Kubernetes infrastructure… βœ… πŸ“œ Base CRDs: Installing... customresourcedefinition.apiextensions.k8s.io/gatewayclasses.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/gateways.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/grpcroutes.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/httproutes.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/referencegrants.gateway.networking.k8s.io created βœ… πŸšͺ GAIE CRDs: Installing... customresourcedefinition.apiextensions.k8s.io/inferencemodels.inference.networking.x-k8s.io created customresourcedefinition.apiextensions.k8s.io/inferencepools.inference.networking.x-k8s.io created βœ… πŸŽ’ Gateway provider 'kgateway': Installing... Release "kgateway-crds" does not exist. Installing it now. Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway-crds:v2.0.0 Digest: sha256:b405a0fbca50ae816bba355f1133cb456f280d9925d824166b7b6fc4e96f2077 NAME: kgateway-crds LAST DEPLOYED: Fri May 30 20:01:46 2025 NAMESPACE: kgateway-system STATUS: deployed REVISION: 1 TEST SUITE: None Release "kgateway" does not exist. Installing it now. Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway:v2.0.0 Digest: sha256:bbd7559eaa05ef6c27382390768889f5475e75bdcb4bd81ebd0f770cd14ab7a8 NAME: kgateway LAST DEPLOYED: Fri May 30 20:01:48 2025 NAMESPACE: kgateway-system STATUS: deployed REVISION: 1 TEST SUITE: None βœ… GAIE infra applied ℹ️ πŸ“¦ Creating namespace llm-d... namespace/llm-d created Context "minikube" modified. βœ… Namespace ready ℹ️ πŸ”Ή Using merged values: /tmp/tmp.xLX5pNIysj ℹ️ πŸ” Creating/updating HF token secret... secret/llm-d-hf-token created βœ… HF token secret created ℹ️ Fetching OCP proxy UID... ℹ️ No OpenShift SCC annotation found; defaulting PROXY_UID=0 ℹ️ πŸ“œ Applying modelservice CRD... customresourcedefinition.apiextensions.k8s.io/modelservices.llm-d.ai unchanged βœ… ModelService CRD applied ℹ️ ⏭️ Model download to PVC skipped: BYO model via HF repo_id selected. protocol hf chosen - models will be downloaded JIT in inferencing pods. "bitnami" already exists with the same configuration, skipping ℹ️ πŸ› οΈ Building Helm chart dependencies... Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "prometheus-community" chart repository ...Successfully got an update from the "bitnami" chart repository Update Complete. ⎈Happy Helming!⎈ Saving 2 charts Downloading common from repo https://charts.bitnami.com/bitnami Downloading redis from repo https://charts.bitnami.com/bitnami Pulled: registry-1.docker.io/bitnamicharts/redis:20.13.4 Digest: sha256:6a389e13237e8e639ec0d445e785aa246b57bfce711b087033a196a291d5c8d7 Deleting outdated charts βœ… Dependencies built ℹ️ Metrics collection disabled by user request ℹ️ Metrics collection disabled by user request ℹ️ Metrics collection disabled by user request ℹ️ 🚚 Deploying llm-d chart with /tmp/tmp.xLX5pNIysj... Release "llm-d" does not exist. Installing it now. NAME: llm-d LAST DEPLOYED: Fri May 30 20:01:56 2025 NAMESPACE: llm-d STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: Thank you for installing llm-d.

Your release is named llm-d.

To learn more about the release, try:

$ helm status llm-d
$ helm get all llm-d

Following presets are available to your users:

Name Description
basic-gpu-preset Basic gpu inference
basic-gpu-with-nixl-preset GPU inference with NIXL P/D KV transfer and cache offloading
basic-gpu-with-nixl-and-redis-lookup-preset GPU inference with NIXL P/D KV transfer, cache offloading and Redis lookup server
basic-sim-preset Basic simulation
βœ… llm-d deployed
βœ… πŸŽ‰ Installation complete.
Installation command for kvcache sent.

--- Waiting for vLLM instance for deployment 'kvcache' to initialize --- ℹ️ This step can take some time, as it may involve downloading large model files and then initializing the vLLM engine. Attempt 1/60: Waiting for 'kvcache' to be ready. Retrying in 30 seconds... Attempt 2/60: Waiting for 'kvcache' to be ready. Retrying in 30 seconds... Attempt 3/60: Waiting for 'kvcache' to be ready. Retrying in 30 seconds... Attempt 4/60: Waiting for 'kvcache' to be ready. Retrying in 30 seconds... Attempt 5/60: Waiting for 'kvcache' to be ready. Retrying in 30 seconds... Attempt 6/60: Waiting for 'kvcache' to be ready. Retrying in 30 seconds... βœ… vLLM instance for deployment 'kvcache' is ready!

--- Running Benchmark for: kvcache (Model: meta-llama/Llama-3.2-3B-Instruct, Prefill: 0, Decode: 4, Input: 1000, Output: 500) --- Metadata: deployment=kvcache gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 Result File: results.json πŸš€ EXEC: ./run-bench.sh --model meta-llama/Llama-3.2-3B-Instruct --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --dataset-name random --input-len 1000 --output-len 500 --request-rates 10,30,inf --metadata deployment=kvcache gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 --result-file results.json secret/hf-token-secret created ▢️ Benchmarking MODEL=meta-llama/Llama-3.2-3B-Instruct at rates: 10 30 inf QPS for 30 seconds each πŸ”– Results will go into ./results.json πŸš€ Launching vllm-bench-job-10qps (QPS=10, prompts=300)… job.batch/vllm-bench-job-10qps created job.batch/vllm-bench-job-10qps condition met πŸ“– Logs from vllm-bench-job-10qps: Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN Starting benchmark at Fri May 30 20:10:01 UTC 2025 ----- ENV VARS ----- BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 MODEL=meta-llama/Llama-3.2-3B-Instruct DATASET_NAME=random RANDOM_INPUT_LEN=1000 RANDOM_OUTPUT_LEN=500 REQUEST_RATE=10 NUM_PROMPTS=300 IGNORE_EOS=true RESULT_FILENAME=results.json METADATA=deployment=kvcache gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500

Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate 10 --num-prompts 300 --save-result --result-filename results.json --metadata deployment=kvcache gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 20:10:11 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 20:10:11 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 20:10:11 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 20:10:11 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 20:10:11 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 20:10:11 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 20:10:11 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 20:10:11 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 20:10:12 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=300, logprobs=None, request_rate=10.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=kvcache', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: 10.0 Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 300/300 [00:35<00:00, 8.34it/s] ============ Serving Benchmark Result ============ Successful requests: 300 Benchmark duration (s): 35.98 Total input tokens: 299700 Total generated tokens: 81393 Request throughput (req/s): 8.34 Output token throughput (tok/s): 2262.19 Total Token throughput (tok/s): 10591.88 ---------------Time to First Token---------------- Mean TTFT (ms): 53.96 Median TTFT (ms): 52.87 P99 TTFT (ms): 87.51 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 13.56 Median TPOT (ms): 13.46 P99 TPOT (ms): 15.65 ---------------Inter-token Latency---------------- Mean ITL (ms): 13.50 Median ITL (ms): 12.90 P99 ITL (ms): 34.21

<<<RESULT_START>>> {"date": "20250530-201101", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 300, "deployment": "kvcache", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 10.0, "burstiness": 1.0, "max_concurrency": null, "duration": 35.97974775400053, "completed": 300, "total_input_tokens": 299700, "total_output_tokens": 81393, "request_throughput": 8.338023992028779, "request_goodput:": null, "output_throughput": 2262.189289277328, "total_token_throughput": 10591.87525731408, "mean_ttft_ms": 53.96254838997568, "median_ttft_ms": 52.86790049967749, "std_ttft_ms": 8.636840862111338, "p99_ttft_ms": 87.51494646001444, "mean_tpot_ms": 13.562497126015346, "median_tpot_ms": 13.460283402405013, "std_tpot_ms": 1.0343708502449698, "p99_tpot_ms": 15.645541584462748, "mean_itl_ms": 13.503936382745758, "median_itl_ms": 12.901992000479368, "std_itl_ms": 3.8045600847990464, "p99_itl_ms": 34.20889711982454} <<<RESULT_END>>> Appended results block for 10 QPS Cleaning up Job vllm-bench-job-10qps... job.batch "vllm-bench-job-10qps" deleted πŸš€ Launching vllm-bench-job-30qps (QPS=30, prompts=900)… job.batch/vllm-bench-job-30qps created job.batch/vllm-bench-job-30qps condition met πŸ“– Logs from vllm-bench-job-30qps: Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN Starting benchmark at Fri May 30 20:11:06 UTC 2025 ----- ENV VARS ----- BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 MODEL=meta-llama/Llama-3.2-3B-Instruct DATASET_NAME=random RANDOM_INPUT_LEN=1000 RANDOM_OUTPUT_LEN=500 REQUEST_RATE=30 NUM_PROMPTS=900 IGNORE_EOS=true RESULT_FILENAME=results.json METADATA=deployment=kvcache gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500

Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate 30 --num-prompts 900 --save-result --result-filename results.json --metadata deployment=kvcache gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 20:11:15 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 20:11:15 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 20:11:15 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 20:11:15 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 20:11:15 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 20:11:15 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 20:11:15 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 20:11:15 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 20:11:16 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=900, logprobs=None, request_rate=30.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=kvcache', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: 30.0 Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 900/900 [00:40<00:00, 21.99it/s] ============ Serving Benchmark Result ============ Successful requests: 900 Benchmark duration (s): 40.93 Total input tokens: 899100 Total generated tokens: 237626 Request throughput (req/s): 21.99 Output token throughput (tok/s): 5806.21 Total Token throughput (tok/s): 27775.05 ---------------Time to First Token---------------- Mean TTFT (ms): 72.16 Median TTFT (ms): 73.32 P99 TTFT (ms): 168.38 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 23.54 Median TPOT (ms): 24.67 P99 TPOT (ms): 32.86 ---------------Inter-token Latency---------------- Mean ITL (ms): 23.37 Median ITL (ms): 21.04 P99 ITL (ms): 75.26

<<<RESULT_START>>> {"date": "20250530-201214", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "kvcache", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 30.0, "burstiness": 1.0, "max_concurrency": null, "duration": 40.92615340700013, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 237626, "request_throughput": 21.99082799328171, "request_goodput:": null, "output_throughput": 5806.213880812844, "total_token_throughput": 27775.05104610127, "mean_ttft_ms": 72.16499725443909, "median_ttft_ms": 73.31621500043184, "std_ttft_ms": 32.75126651570117, "p99_ttft_ms": 168.3847288293964, "mean_tpot_ms": 23.541025315462797, "median_tpot_ms": 24.670025282634825, "std_tpot_ms": 5.525520582614772, "p99_tpot_ms": 32.858918094328644, "mean_itl_ms": 23.368492361219396, "median_itl_ms": 21.040741000433627, "std_itl_ms": 10.963675513599902, "p99_itl_ms": 75.25774875034585} <<<RESULT_END>>> Appended results block for 30 QPS Cleaning up Job vllm-bench-job-30qps... job.batch "vllm-bench-job-30qps" deleted πŸš€ Launching vllm-bench-job-inf (infinite QPS, prompts=900)… job.batch/vllm-bench-job-inf created job.batch/vllm-bench-job-inf condition met πŸ“– Logs from vllm-bench-job-inf: Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN Starting benchmark at Fri May 30 20:12:17 UTC 2025 ----- ENV VARS ----- BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 MODEL=meta-llama/Llama-3.2-3B-Instruct DATASET_NAME=random RANDOM_INPUT_LEN=1000 RANDOM_OUTPUT_LEN=500 REQUEST_RATE=inf NUM_PROMPTS=900 IGNORE_EOS=true RESULT_FILENAME=results.json METADATA=deployment=kvcache gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500

Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate inf --num-prompts 900 --save-result --result-filename results.json --metadata deployment=kvcache gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 20:12:26 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 20:12:26 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 20:12:26 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 20:12:26 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 20:12:26 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 20:12:26 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 20:12:26 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 20:12:26 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 20:12:27 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=900, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=kvcache', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: inf Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 900/900 [00:21<00:00, 41.22it/s] ============ Serving Benchmark Result ============ Successful requests: 900 Benchmark duration (s): 21.84 Total input tokens: 899100 Total generated tokens: 240439 Request throughput (req/s): 41.22 Output token throughput (tok/s): 11011.35 Total Token throughput (tok/s): 52187.30 ---------------Time to First Token---------------- Mean TTFT (ms): 884.96 Median TTFT (ms): 867.67 P99 TTFT (ms): 1249.64 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 45.02 Median TPOT (ms): 44.02 P99 TPOT (ms): 71.63 ---------------Inter-token Latency---------------- Mean ITL (ms): 40.44 Median ITL (ms): 37.50 P99 ITL (ms): 67.52

<<<RESULT_START>>> {"date": "20250530-201304", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "kvcache", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 21.835562008000124, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 240439, "request_throughput": 41.21716673334341, "request_goodput:": null, "output_throughput": 11011.349280220396, "total_token_throughput": 52187.29884683046, "mean_ttft_ms": 884.9588357166714, "median_ttft_ms": 867.6726769999732, "std_ttft_ms": 164.40779421335031, "p99_ttft_ms": 1249.6421772296253, "mean_tpot_ms": 45.0189959701786, "median_tpot_ms": 44.02160330742376, "std_tpot_ms": 8.692672848556585, "p99_tpot_ms": 71.62702592521168, "mean_itl_ms": 40.442648903097236, "median_itl_ms": 37.49666700059606, "std_itl_ms": 9.076602885769773, "p99_itl_ms": 67.51899562010284} <<<RESULT_END>>> Appended results block for infinite QPS Cleaning up Job vllm-bench-job-inf... job.batch "vllm-bench-job-inf" deleted βœ… All benchmarks complete. Combined results in ./results.json Benchmark for kvcache completed.

--- Uninstalling LLM Deployment: kvcache --- πŸš€ EXEC: ./llmd-installer.sh --minikube --uninstall ℹ️ πŸ“‚ Setting up script environment... ℹ️ kubectl can reach to a running Kubernetes cluster. ℹ️ πŸ—‘οΈ Tearing down GAIE Kubernetes infrastructure… βœ… πŸ“œ Base CRDs: Deleting... customresourcedefinition.apiextensions.k8s.io "gatewayclasses.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "gateways.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "grpcroutes.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "httproutes.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "referencegrants.gateway.networking.k8s.io" deleted βœ… πŸšͺ GAIE CRDs: Deleting... customresourcedefinition.apiextensions.k8s.io "inferencemodels.inference.networking.x-k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "inferencepools.inference.networking.x-k8s.io" deleted βœ… πŸŽ’ Gateway provider 'kgateway': Deleting... release "kgateway" uninstalled release "kgateway-crds" uninstalled Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway-crds:v2.0.0 Digest: sha256:b405a0fbca50ae816bba355f1133cb456f280d9925d824166b7b6fc4e96f2077 ℹ️ πŸ—‘οΈ Uninstalling llm-d chart... release "llm-d" uninstalled ℹ️ πŸ—‘οΈ Deleting namespace llm-d... namespace "llm-d" deleted ℹ️ πŸ—‘οΈ Deleting monitoring namespace... ℹ️ πŸ—‘οΈ Deleting Minikube hostPath PV (model-hostpath-pv)... ℹ️ πŸ—‘οΈ Deleting ClusterRoleBinding llm-d No resources found βœ… πŸ’€ Uninstallation complete Uninstallation for kvcache completed.

Completed cycle for kvcache.

πŸŽ‰ All configured deployments processed. ========= Full Deployment and Benchmark Process Finished =========


Total script execution time: 42m 46s (Total: 2566 seconds)

ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$ cat results.json

{"date": "20250530-194738", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 300, "deployment": "no-features", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 10.0, "burstiness": 1.0, "max_concurrency": null, "duration": 35.84472489100017, "completed": 300, "total_input_tokens": 299700, "total_output_tokens": 80945, "request_throughput": 8.369432347779671, "request_goodput:": null, "output_throughput": 2258.212337970085, "total_token_throughput": 10619.275253401978, "mean_ttft_ms": 54.59561144333596, "median_ttft_ms": 52.868896000063614, "std_ttft_ms": 10.45418108071603, "p99_ttft_ms": 98.69780922984769, "mean_tpot_ms": 13.710971568239172, "median_tpot_ms": 13.686303201518745, "std_tpot_ms": 0.9966270621573491, "p99_tpot_ms": 15.793261194984588, "mean_itl_ms": 13.73187330695019, "median_itl_ms": 13.098381999952835, "std_itl_ms": 4.023339545674027, "p99_itl_ms": 34.48662216016601} {"date": "20250530-194848", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "no-features", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 30.0, "burstiness": 1.0, "max_concurrency": null, "duration": 40.722225800999695, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 237873, "request_throughput": 22.100953037245468, "request_goodput:": null, "output_throughput": 5841.355557587435, "total_token_throughput": 27920.20764179566, "mean_ttft_ms": 80.19281090889662, "median_ttft_ms": 76.65224249990388, "std_ttft_ms": 26.85547410843286, "p99_ttft_ms": 171.67865642984907, "mean_tpot_ms": 24.942996925599196, "median_tpot_ms": 25.466091678535058, "std_tpot_ms": 5.279276119470118, "p99_tpot_ms": 35.581390495641394, "mean_itl_ms": 24.814122015140974, "median_itl_ms": 21.825537999575317, "std_itl_ms": 11.660791168594844, "p99_itl_ms": 74.24358160013071} {"date": "20250530-194947", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "no-features", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 28.861716763000004, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 240165, "request_throughput": 31.183176225808488, "request_goodput:": null, "output_throughput": 8321.230575856995, "total_token_throughput": 39473.22362543967, "mean_ttft_ms": 4386.739448253334, "median_ttft_ms": 4066.1594384998807, "std_ttft_ms": 2469.091218460141, "p99_ttft_ms": 9951.535929819987, "mean_tpot_ms": 61.81461881712581, "median_tpot_ms": 53.13592476507745, "std_tpot_ms": 22.250639166144722, "p99_tpot_ms": 118.56057960224177, "mean_itl_ms": 48.16794500382424, "median_itl_ms": 37.264703000346344, "std_itl_ms": 26.871669857570474, "p99_itl_ms": 134.96996935997225} {"date": "20250530-195922", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 300, "deployment": "base", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 10.0, "burstiness": 1.0, "max_concurrency": null, "duration": 36.19546746500009, "completed": 300, "total_input_tokens": 299700, "total_output_tokens": 82608, "request_throughput": 8.288330584211705, "request_goodput:": null, "output_throughput": 2282.2747096685353, "total_token_throughput": 10562.316963296029, "mean_ttft_ms": 54.592403276680365, "median_ttft_ms": 52.79431650001243, "std_ttft_ms": 9.218280749895815, "p99_ttft_ms": 89.80040712966002, "mean_tpot_ms": 13.81029661989983, "median_tpot_ms": 13.619587584448205, "std_tpot_ms": 1.1567995163110574, "p99_tpot_ms": 17.1377381639993, "mean_itl_ms": 13.787254072508166, "median_itl_ms": 13.049414999841247, "std_itl_ms": 4.111252531976249, "p99_itl_ms": 34.86666600036186} {"date": "20250530-200032", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "base", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 30.0, "burstiness": 1.0, "max_concurrency": null, "duration": 40.77285659999961, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 240128, "request_throughput": 22.073508580215805, "request_goodput:": null, "output_throughput": 5889.408298166733, "total_token_throughput": 27940.843369802322, "mean_ttft_ms": 73.27519322778042, "median_ttft_ms": 72.58796349969998, "std_ttft_ms": 34.88893424361722, "p99_ttft_ms": 182.99465706019868, "mean_tpot_ms": 23.98439489090563, "median_tpot_ms": 24.200584621242403, "std_tpot_ms": 6.2426528186199635, "p99_tpot_ms": 35.31807143475544, "mean_itl_ms": 23.91148948750557, "median_itl_ms": 21.303171499766904, "std_itl_ms": 11.541504488505524, "p99_itl_ms": 75.27688499983014} {"date": "20250530-200126", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "base", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 23.8322206619996, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 241786, "request_throughput": 37.76400079389359, "request_goodput:": null, "output_throughput": 10145.340773280393, "total_token_throughput": 47871.57756638009, "mean_ttft_ms": 1253.6749797966502, "median_ttft_ms": 937.9455774997041, "std_ttft_ms": 750.4385642809029, "p99_ttft_ms": 3155.812861540362, "mean_tpot_ms": 46.333835858714664, "median_tpot_ms": 43.749258578274386, "std_tpot_ms": 12.805014931694936, "p99_tpot_ms": 97.13205872092182, "mean_itl_ms": 40.968254144283954, "median_itl_ms": 37.628544499966665, "std_itl_ms": 11.948914401781659, "p99_itl_ms": 86.78916550024948} {"date": "20250530-201101", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 300, "deployment": "kvcache", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 10.0, "burstiness": 1.0, "max_concurrency": null, "duration": 35.97974775400053, "completed": 300, "total_input_tokens": 299700, "total_output_tokens": 81393, "request_throughput": 8.338023992028779, "request_goodput:": null, "output_throughput": 2262.189289277328, "total_token_throughput": 10591.87525731408, "mean_ttft_ms": 53.96254838997568, "median_ttft_ms": 52.86790049967749, "std_ttft_ms": 8.636840862111338, "p99_ttft_ms": 87.51494646001444, "mean_tpot_ms": 13.562497126015346, "median_tpot_ms": 13.460283402405013, "std_tpot_ms": 1.0343708502449698, "p99_tpot_ms": 15.645541584462748, "mean_itl_ms": 13.503936382745758, "median_itl_ms": 12.901992000479368, "std_itl_ms": 3.8045600847990464, "p99_itl_ms": 34.20889711982454} {"date": "20250530-201214", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "kvcache", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 30.0, "burstiness": 1.0, "max_concurrency": null, "duration": 40.92615340700013, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 237626, "request_throughput": 21.99082799328171, "request_goodput:": null, "output_throughput": 5806.213880812844, "total_token_throughput": 27775.05104610127, "mean_ttft_ms": 72.16499725443909, "median_ttft_ms": 73.31621500043184, "std_ttft_ms": 32.75126651570117, "p99_ttft_ms": 168.3847288293964, "mean_tpot_ms": 23.541025315462797, "median_tpot_ms": 24.670025282634825, "std_tpot_ms": 5.525520582614772, "p99_tpot_ms": 32.858918094328644, "mean_itl_ms": 23.368492361219396, "median_itl_ms": 21.040741000433627, "std_itl_ms": 10.963675513599902, "p99_itl_ms": 75.25774875034585} {"date": "20250530-201304", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "kvcache", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 21.835562008000124, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 240439, "request_throughput": 41.21716673334341, "request_goodput:": null, "output_throughput": 11011.349280220396, "total_token_throughput": 52187.29884683046, "mean_ttft_ms": 884.9588357166714, "median_ttft_ms": 867.6726769999732, "std_ttft_ms": 164.40779421335031, "p99_ttft_ms": 1249.6421772296253, "mean_tpot_ms": 45.0189959701786, "median_tpot_ms": 44.02160330742376, "std_tpot_ms": 8.692672848556585, "p99_tpot_ms": 71.62702592521168, "mean_itl_ms": 40.442648903097236, "median_itl_ms": 37.49666700059606, "std_itl_ms": 9.076602885769773, "p99_itl_ms": 67.51899562010284} ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$ ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$ echo > results.json ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$ ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$ ls README-minikube.md e2e-bench-control.sh grafana infra istio-test-request.sh metrics-overview.md results.json.all run-bench.sh README.md examples grafana-setup.md install-deps.sh llmd-installer.sh results.json results.json.bak test-request.sh ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$ ./run-bench.sh --model meta-llama/Llama-3.2-3B-Instruct
--base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80
--dataset-name random
--input-len 1000
--output-len 500
--request-rates 10,30,inf
--metadata "deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500"
--result-file results.json namespace/llm-d created secret/hf-token-secret created ▢️ Benchmarking MODEL=meta-llama/Llama-3.2-3B-Instruct at rates: 10 30 inf QPS for 30 seconds each πŸ”– Results will go into ./results.json πŸš€ Launching vllm-bench-job-10qps (QPS=10, prompts=300)… job.batch/vllm-bench-job-10qps created ^Cubuntu@ip-172-31-16-33:
/llm-d-deployer/quickstart$ ^C ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$ ./llmd-installer.sh --uninstall ℹ️ πŸ“‚ Setting up script environment... ℹ️ kubectl can reach to a running Kubernetes cluster. ℹ️ πŸ—‘οΈ Tearing down GAIE Kubernetes infrastructure… βœ… πŸ“œ Base CRDs: Deleting... Error from server (NotFound): error when deleting "https://github.com/llm-d/llm-d-inference-scheduler/deploy/components/crds-gateway-api": customresourcedefinitions.apiextensions.k8s.io "gatewayclasses.gateway.networking.k8s.io" not found Error from server (NotFound): error when deleting "https://github.com/llm-d/llm-d-inference-scheduler/deploy/components/crds-gateway-api": customresourcedefinitions.apiextensions.k8s.io "gateways.gateway.networking.k8s.io" not found Error from server (NotFound): error when deleting "https://github.com/llm-d/llm-d-inference-scheduler/deploy/components/crds-gateway-api": customresourcedefinitions.apiextensions.k8s.io "grpcroutes.gateway.networking.k8s.io" not found Error from server (NotFound): error when deleting "https://github.com/llm-d/llm-d-inference-scheduler/deploy/components/crds-gateway-api": customresourcedefinitions.apiextensions.k8s.io "httproutes.gateway.networking.k8s.io" not found Error from server (NotFound): error when deleting "https://github.com/llm-d/llm-d-inference-scheduler/deploy/components/crds-gateway-api": customresourcedefinitions.apiextensions.k8s.io "referencegrants.gateway.networking.k8s.io" not found βœ… πŸšͺ GAIE CRDs: Deleting... Error from server (NotFound): error when deleting "https://github.com/llm-d/llm-d-inference-scheduler/deploy/components/crds-gie": customresourcedefinitions.apiextensions.k8s.io "inferencemodels.inference.networking.x-k8s.io" not found Error from server (NotFound): error when deleting "https://github.com/llm-d/llm-d-inference-scheduler/deploy/components/crds-gie": customresourcedefinitions.apiextensions.k8s.io "inferencepools.inference.networking.x-k8s.io" not found βœ… πŸŽ’ Gateway provider 'kgateway': Deleting... release "kgateway" uninstalled release "kgateway-crds" uninstalled Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway-crds:v2.0.0 Digest: sha256:b405a0fbca50ae816bba355f1133cb456f280d9925d824166b7b6fc4e96f2077 ℹ️ πŸ—‘οΈ Uninstalling llm-d chart... release "llm-d" uninstalled ℹ️ πŸ—‘οΈ Deleting namespace llm-d... namespace "llm-d" deleted ℹ️ πŸ—‘οΈ Deleting monitoring namespace... ℹ️ πŸ—‘οΈ Deleting ClusterRoleBinding llm-d No resources found βœ… πŸ’€ Uninstallation complete ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$ ls examples/ all-features/ base/ kvcache/ llama4-fp8.yaml no-features/ pd-nixl/ ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$ ls examples/base/ base.yaml slim/ ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$ ls examples/base/base.yaml ^C ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$ ls examples/kvcache/kvcache.yaml ^C ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$ ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$ ./llmd-installer.sh --values-file examples/kvcache/kvcache.yaml --minikube ℹ️ πŸ“‚ Setting up script environment... ℹ️ kubectl can reach to a running Kubernetes cluster. βœ… HF_TOKEN validated ℹ️ πŸ—οΈ Installing GAIE Kubernetes infrastructure… βœ… πŸ“œ Base CRDs: Installing... customresourcedefinition.apiextensions.k8s.io/gatewayclasses.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/gateways.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/grpcroutes.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/httproutes.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/referencegrants.gateway.networking.k8s.io created βœ… πŸšͺ GAIE CRDs: Installing... customresourcedefinition.apiextensions.k8s.io/inferencemodels.inference.networking.x-k8s.io created customresourcedefinition.apiextensions.k8s.io/inferencepools.inference.networking.x-k8s.io created βœ… πŸŽ’ Gateway provider 'kgateway': Installing... Release "kgateway-crds" does not exist. Installing it now. Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway-crds:v2.0.0 Digest: sha256:b405a0fbca50ae816bba355f1133cb456f280d9925d824166b7b6fc4e96f2077 NAME: kgateway-crds LAST DEPLOYED: Fri May 30 20:27:50 2025 NAMESPACE: kgateway-system STATUS: deployed REVISION: 1 TEST SUITE: None Release "kgateway" does not exist. Installing it now. Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway:v2.0.0 Digest: sha256:bbd7559eaa05ef6c27382390768889f5475e75bdcb4bd81ebd0f770cd14ab7a8 NAME: kgateway LAST DEPLOYED: Fri May 30 20:27:51 2025 NAMESPACE: kgateway-system STATUS: deployed REVISION: 1 TEST SUITE: None βœ… GAIE infra applied ℹ️ πŸ“¦ Creating namespace llm-d... namespace/llm-d created Context "minikube" modified. βœ… Namespace ready ℹ️ πŸ”Ή Using merged values: /tmp/tmp.s8hh00P0yh ℹ️ πŸ” Creating/updating HF token secret... secret/llm-d-hf-token created βœ… HF token secret created ℹ️ Fetching OCP proxy UID... ℹ️ No OpenShift SCC annotation found; defaulting PROXY_UID=0 ℹ️ πŸ“œ Applying modelservice CRD... customresourcedefinition.apiextensions.k8s.io/modelservices.llm-d.ai unchanged βœ… ModelService CRD applied ℹ️ ⏭️ Model download to PVC skipped: BYO model via HF repo_id selected. protocol hf chosen - models will be downloaded JIT in inferencing pods. "bitnami" already exists with the same configuration, skipping ℹ️ πŸ› οΈ Building Helm chart dependencies... Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "prometheus-community" chart repository ...Successfully got an update from the "bitnami" chart repository Update Complete. ⎈Happy Helming!⎈ Saving 2 charts Downloading common from repo https://charts.bitnami.com/bitnami Downloading redis from repo https://charts.bitnami.com/bitnami Pulled: registry-1.docker.io/bitnamicharts/redis:20.13.4 Digest: sha256:6a389e13237e8e639ec0d445e785aa246b57bfce711b087033a196a291d5c8d7 Deleting outdated charts βœ… Dependencies built ℹ️ πŸ” Checking for ServiceMonitor CRD (monitoring.coreos.com)... ℹ️ ⚠️ ServiceMonitor CRD (monitoring.coreos.com) not found ℹ️ ⚠️ ServiceMonitor CRD (monitoring.coreos.com) not found ℹ️ πŸ” Checking for ServiceMonitor CRD (monitoring.coreos.com)... ℹ️ ⚠️ ServiceMonitor CRD (monitoring.coreos.com) not found ℹ️ 🌱 Provisioning Prometheus operator… ℹ️ πŸ“¦ Creating monitoring namespace... namespace/llm-d-monitoring created ℹ️ πŸš€ Installing Prometheus stack... ℹ️ ⏳ Waiting for Prometheus stack pods to be ready... pod/prometheus-prometheus-kube-prometheus-prometheus-0 condition met pod/prometheus-grafana-6ffc4fbfb7-9qwpf condition met βœ… πŸš€ Prometheus and Grafana installed. ℹ️ 🌱 Minikube detected; provisioning Prometheus/Grafana… ℹ️ 🌱 Provisioning Prometheus operator… ℹ️ πŸ“¦ Monitoring namespace already exists ℹ️ ⚠️ Prometheus stack already installed in llm-d-monitoring namespace ℹ️ Metrics collection enabled ℹ️ 🚚 Deploying llm-d chart with /tmp/tmp.s8hh00P0yh... Release "llm-d" does not exist. Installing it now. NAME: llm-d LAST DEPLOYED: Fri May 30 20:29:15 2025 NAMESPACE: llm-d STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: Thank you for installing llm-d.

Your release is named llm-d.

To learn more about the release, try:

$ helm status llm-d
$ helm get all llm-d

Following presets are available to your users:

Name Description
basic-gpu-preset Basic gpu inference
basic-gpu-with-nixl-preset GPU inference with NIXL P/D KV transfer and cache offloading
basic-gpu-with-nixl-and-redis-lookup-preset GPU inference with NIXL P/D KV transfer, cache offloading and Redis lookup server
basic-sim-preset Basic simulation
βœ… llm-d deployed
βœ… πŸŽ‰ Installation complete.
ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$ ./^Cn-bench.sh --model meta-llama/Llama-3.2-3B-Instruct --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --dataset-name random --input-len 1000 --output-len 500 --request-rates 10,30,inf --metadata "deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500" --result-file results.json
ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$
ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$
ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$
ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$
ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$
ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$ kubectl get pods # --all-namespaces
NAME READY STATUS RESTARTS AGE
llm-d-inference-gateway-5fbd8c566-htfz5 1/1 Running 0 49s
llm-d-modelservice-5757d7b578-zgr4g 1/1 Running 0 50s
meta-llama-llama-3-2-3b-instruct-decode-7bf457bdcc-24dc9 2/2 Running 0 48s
meta-llama-llama-3-2-3b-instruct-decode-7bf457bdcc-dxpb4 2/2 Running 0 48s
meta-llama-llama-3-2-3b-instruct-decode-7bf457bdcc-qm4gs 2/2 Running 0 48s
meta-llama-llama-3-2-3b-instruct-decode-7bf457bdcc-xpp6n 2/2 Running 0 48s
meta-llama-llama-3-2-3b-instruct-epp-555969c945-j8h74 1/1 Running 0 48s
ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$ ./test-request.sh --minikube
Namespace: llm-d
Model ID: none; will be discover from first entry in /v1/models

Minikube validation: hitting gateway DNS at llm-d-inference-gateway.llm-d.svc.cluster.local:80 1 -> GET /v1/models via DNS at llm-d-inference-gateway.llm-d.svc.cluster.local:80… pod "curl-2965" deleted

ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$ ./test-request.sh --minikube Namespace: llm-d Model ID: none; will be discover from first entry in /v1/models

Minikube validation: hitting gateway DNS at llm-d-inference-gateway.llm-d.svc.cluster.local:80 1 -> GET /v1/models via DNS at llm-d-inference-gateway.llm-d.svc.cluster.local:80… error: timed out waiting for the condition ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$ ./^Cmd-installer.sh --values-file examples/kvcache/kvcache.yaml --minikube ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$ ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$ ./run-bench.sh --model meta-llama/Llama-3.2-3B-Instruct
--base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80
--dataset-name random
--input-len 1000
--output-len 500
--request-rates 10,30,inf
--metadata "deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500"
--result-file results.json secret/hf-token-secret created ▢️ Benchmarking MODEL=meta-llama/Llama-3.2-3B-Instruct at rates: 10 30 inf QPS for 30 seconds each πŸ”– Results will go into ./results.json πŸš€ Launching vllm-bench-job-10qps (QPS=10, prompts=300)… job.batch/vllm-bench-job-10qps created job.batch/vllm-bench-job-10qps condition met πŸ“– Logs from vllm-bench-job-10qps: Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN Starting benchmark at Fri May 30 20:37:06 UTC 2025 ----- ENV VARS ----- BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 MODEL=meta-llama/Llama-3.2-3B-Instruct DATASET_NAME=random RANDOM_INPUT_LEN=1000 RANDOM_OUTPUT_LEN=500 REQUEST_RATE=10 NUM_PROMPTS=300 IGNORE_EOS=true RESULT_FILENAME=results.json METADATA=deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500

Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate 10 --num-prompts 300 --save-result --result-filename results.json --metadata deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 20:37:15 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 20:37:15 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 20:37:15 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 20:37:15 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 20:37:15 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 20:37:15 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 20:37:15 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 20:37:15 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 20:37:17 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=300, logprobs=None, request_rate=10.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=base', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: 10.0 Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 300/300 [00:36<00:00, 8.33it/s] ============ Serving Benchmark Result ============ Successful requests: 300 Benchmark duration (s): 36.03 Total input tokens: 299700 Total generated tokens: 81050 Request throughput (req/s): 8.33 Output token throughput (tok/s): 2249.25 Total Token throughput (tok/s): 10566.33 ---------------Time to First Token---------------- Mean TTFT (ms): 54.58 Median TTFT (ms): 53.00 P99 TTFT (ms): 89.77 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 13.58 Median TPOT (ms): 13.51 P99 TPOT (ms): 17.65 ---------------Inter-token Latency---------------- Mean ITL (ms): 13.51 Median ITL (ms): 12.84 P99 ITL (ms): 34.21

<<<RESULT_START>>> {"date": "20250530-203806", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 300, "deployment": "base", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 10.0, "burstiness": 1.0, "max_concurrency": null, "duration": 36.034275877000255, "completed": 300, "total_input_tokens": 299700, "total_output_tokens": 81050, "request_throughput": 8.325406649602808, "request_goodput:": null, "output_throughput": 2249.247363167692, "total_token_throughput": 10566.328606120898, "mean_ttft_ms": 54.577085783294024, "median_ttft_ms": 52.999333499883505, "std_ttft_ms": 9.27197373989555, "p99_ttft_ms": 89.76647855975897, "mean_tpot_ms": 13.577995860227945, "median_tpot_ms": 13.512526767533997, "std_tpot_ms": 0.9199628548912276, "p99_tpot_ms": 17.652194109258673, "mean_itl_ms": 13.505887271504633, "median_itl_ms": 12.841573499827064, "std_itl_ms": 3.839438281316489, "p99_itl_ms": 34.210415409961556} <<<RESULT_END>>> Appended results block for 10 QPS Cleaning up Job vllm-bench-job-10qps... job.batch "vllm-bench-job-10qps" deleted πŸš€ Launching vllm-bench-job-30qps (QPS=30, prompts=900)… job.batch/vllm-bench-job-30qps created job.batch/vllm-bench-job-30qps condition met πŸ“– Logs from vllm-bench-job-30qps: Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN Starting benchmark at Fri May 30 20:38:10 UTC 2025 ----- ENV VARS ----- BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 MODEL=meta-llama/Llama-3.2-3B-Instruct DATASET_NAME=random RANDOM_INPUT_LEN=1000 RANDOM_OUTPUT_LEN=500 REQUEST_RATE=30 NUM_PROMPTS=900 IGNORE_EOS=true RESULT_FILENAME=results.json METADATA=deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500

Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate 30 --num-prompts 900 --save-result --result-filename results.json --metadata deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 20:38:19 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 20:38:19 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 20:38:19 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 20:38:19 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 20:38:19 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 20:38:19 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 20:38:19 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 20:38:19 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 20:38:20 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=900, logprobs=None, request_rate=30.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=base', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: 30.0 Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 900/900 [00:40<00:00, 22.18it/s] ============ Serving Benchmark Result ============ Successful requests: 900 Benchmark duration (s): 40.57 Total input tokens: 899100 Total generated tokens: 239959 Request throughput (req/s): 22.18 Output token throughput (tok/s): 5914.31 Total Token throughput (tok/s): 28074.59 ---------------Time to First Token---------------- Mean TTFT (ms): 73.23 Median TTFT (ms): 74.51 P99 TTFT (ms): 169.39 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 23.59 Median TPOT (ms): 24.65 P99 TPOT (ms): 32.72 ---------------Inter-token Latency---------------- Mean ITL (ms): 23.47 Median ITL (ms): 20.99 P99 ITL (ms): 75.82

<<<RESULT_START>>> {"date": "20250530-203916", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "base", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 30.0, "burstiness": 1.0, "max_concurrency": null, "duration": 40.57259980900017, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 239959, "request_throughput": 22.182458216551215, "request_goodput:": null, "output_throughput": 5914.311656872681, "total_token_throughput": 28074.587415207345, "mean_ttft_ms": 73.22729984779572, "median_ttft_ms": 74.51430299988715, "std_ttft_ms": 33.92979932231269, "p99_ttft_ms": 169.39216242027214, "mean_tpot_ms": 23.586269942687313, "median_tpot_ms": 24.64663913131863, "std_tpot_ms": 5.4388490325372665, "p99_tpot_ms": 32.724085161084425, "mean_itl_ms": 23.466772128068868, "median_itl_ms": 20.98881700021593, "std_itl_ms": 11.048866594118087, "p99_itl_ms": 75.82270559991227} <<<RESULT_END>>> Appended results block for 30 QPS Cleaning up Job vllm-bench-job-30qps... job.batch "vllm-bench-job-30qps" deleted πŸš€ Launching vllm-bench-job-inf (infinite QPS, prompts=900)… job.batch/vllm-bench-job-inf created job.batch/vllm-bench-job-inf condition met πŸ“– Logs from vllm-bench-job-inf: Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN Starting benchmark at Fri May 30 20:39:20 UTC 2025 ----- ENV VARS ----- BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 MODEL=meta-llama/Llama-3.2-3B-Instruct DATASET_NAME=random RANDOM_INPUT_LEN=1000 RANDOM_OUTPUT_LEN=500 REQUEST_RATE=inf NUM_PROMPTS=900 IGNORE_EOS=true RESULT_FILENAME=results.json METADATA=deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500

Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate inf --num-prompts 900 --save-result --result-filename results.json --metadata deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 20:39:29 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 20:39:30 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 20:39:30 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 20:39:30 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 20:39:30 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 20:39:30 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 20:39:30 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 20:39:30 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 20:39:31 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=900, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=base', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: inf Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 900/900 [00:20<00:00, 43.02it/s] ============ Serving Benchmark Result ============ Successful requests: 900 Benchmark duration (s): 20.92 Total input tokens: 899100 Total generated tokens: 240163 Request throughput (req/s): 43.02 Output token throughput (tok/s): 11481.02 Total Token throughput (tok/s): 54462.58 ---------------Time to First Token---------------- Mean TTFT (ms): 902.25 Median TTFT (ms): 877.08 P99 TTFT (ms): 1353.17 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 44.36 Median TPOT (ms): 43.74 P99 TPOT (ms): 67.41 ---------------Inter-token Latency---------------- Mean ITL (ms): 40.13 Median ITL (ms): 36.94 P99 ITL (ms): 66.37

<<<RESULT_START>>> {"date": "20250530-204006", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "base", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 20.918269691000205, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 240163, "request_throughput": 43.024591101204344, "request_goodput:": null, "output_throughput": 11481.016525153933, "total_token_throughput": 54462.58303525708, "mean_ttft_ms": 902.2531994799939, "median_ttft_ms": 877.0819514993491, "std_ttft_ms": 194.26376120632074, "p99_ttft_ms": 1353.1694669699937, "mean_tpot_ms": 44.35695594492121, "median_tpot_ms": 43.742781012684844, "std_tpot_ms": 7.88600060931539, "p99_tpot_ms": 67.41057380823649, "mean_itl_ms": 40.126101133447385, "median_itl_ms": 36.940078000043286, "std_itl_ms": 8.77480669076588, "p99_itl_ms": 66.37176537940839} <<<RESULT_END>>> Appended results block for infinite QPS Cleaning up Job vllm-bench-job-inf... job.batch "vllm-bench-job-inf" deleted βœ… All benchmarks complete. Combined results in ./results.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment