$ ENV_METADATA_GPU="4xNVIDIA_L40S" \
./e2e-bench-control.sh --4xgpu-minikube --model meta-llama/Llama-3.2-3B-Instruct
π LLM Deployment and Benchmark Orchestrator π
-------------------------------------------------
--- Configuration Summary ---
Minikube Start Args (Hardcoded): --driver docker --container-runtime docker --gpus all --memory no-limit --cpus no-limit
LLMD Installer Script (Hardcoded): ./llmd-installer.sh
Test Request Script (Hardcoded): ./test-request.sh (Args: --minikube, Retry: 30s)
Run Bench Script (Hardcoded): ./run-bench.sh
Benchmark model for metadata will be derived from the --model flag or dynamically from each values file.
Benchmark Metadata GPU (Configurable, consistent for all runs): 4xNVIDIA_L40S
Benchmark Metadata Gateway (Configurable): kgateway
Result File (Configurable): results.json
YAML Modifications for --4xgpu-minikube: ENABLED
YAML Model Override (from --model flag): ENABLED (Model: meta-llama/Llama-3.2-3B-Instruct)
Deployments to process: 3
- Values: examples/no-features/no-features.yaml
- Values: examples/base/base.yaml
- Values: examples/kvcache/kvcache.yaml
-----------------------------
π οΈ Applying --4xgpu-minikube modifications to YAML files...
Processing examples/no-features/no-features.yaml for --4xgpu-minikube potential modification...
Using default replica counts for examples/no-features/no-features.yaml.
Successfully updated replicas and nodeSelectors in examples/no-features/no-features.yaml.
Processing examples/base/base.yaml for --4xgpu-minikube potential modification...
Using default replica counts for examples/base/base.yaml.
Successfully updated replicas and nodeSelectors in examples/base/base.yaml.
Processing examples/kvcache/kvcache.yaml for --4xgpu-minikube potential modification...
Using default replica counts for examples/kvcache/kvcache.yaml.
Successfully updated replicas and nodeSelectors in examples/kvcache/kvcache.yaml.
π οΈ --4xgpu-minikube modifications complete.
-------------------------------------
π οΈ Applying --model 'meta-llama/Llama-3.2-3B-Instruct' modifications to YAML files...
Processing examples/no-features/no-features.yaml for model override...
Successfully updated model in examples/no-features/no-features.yaml.
Processing examples/base/base.yaml for model override...
Successfully updated model in examples/base/base.yaml.
Processing examples/kvcache/kvcache.yaml for model override...
Successfully updated model in examples/kvcache/kvcache.yaml.
π οΈ Model override modifications complete.
-------------------------------------
========= Starting Full Deployment and Benchmark Process =========
--- Minikube Setup ---
Attempting to delete any existing Minikube instance if one exists...
Starting Minikube with: --driver docker --container-runtime docker --gpus all --memory no-limit --cpus no-limit
π EXEC: minikube start --driver docker --container-runtime docker --gpus all --memory no-limit --cpus no-limit
π minikube v1.35.0 on Ubuntu 24.04
β¨ Using the docker driver based on user configuration
π Using Docker driver with root privileges
π Starting "minikube" primary control-plane node in "minikube" cluster
π Pulling base image v0.0.46 ...
π₯ Creating docker container (CPUs=no-limit, Memory=no-limit) ...
π³ Preparing Kubernetes v1.32.0 on Docker 27.4.1 ...
βͺ Generating certificates and keys ...
βͺ Booting up control plane ...
βͺ Configuring RBAC rules ...
π Configuring bridge CNI (Container Networking Interface) ...
π Verifying Kubernetes components...
βͺ Using image gcr.io/k8s-minikube/storage-provisioner:v5
βͺ Using image nvcr.io/nvidia/k8s-device-plugin:v0.17.0
π Enabled addons: nvidia-device-plugin, default-storageclass
π Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default
Minikube started. Waiting for 10 seconds for stabilization...
Minikube setup complete.3
-------------------------
Processing Deployment 1/3: no-features
=================================================================
--- Installing LLM Deployment: no-features (using examples/no-features/no-features.yaml) ---
π EXEC: ./llmd-installer.sh --minikube --values-file examples/no-features/no-features.yaml --disable-metrics-collection
βΉοΈ π Setting up script environment...
βΉοΈ kubectl can reach to a running Kubernetes cluster.
β
HF_TOKEN validated
βΉοΈ ποΈ Installing GAIE Kubernetes infrastructureβ¦
β
π Base CRDs: Installing...
customresourcedefinition.apiextensions.k8s.io/gatewayclasses.gateway.networking.k8s.io created
customresourcedefinition.apiextensions.k8s.io/gateways.gateway.networking.k8s.io created
customresourcedefinition.apiextensions.k8s.io/grpcroutes.gateway.networking.k8s.io created
customresourcedefinition.apiextensions.k8s.io/httproutes.gateway.networking.k8s.io created
customresourcedefinition.apiextensions.k8s.io/referencegrants.gateway.networking.k8s.io created
β
πͺ GAIE CRDs: Installing...
customresourcedefinition.apiextensions.k8s.io/inferencemodels.inference.networking.x-k8s.io created
customresourcedefinition.apiextensions.k8s.io/inferencepools.inference.networking.x-k8s.io created
β
π Gateway provider 'kgateway': Installing...
Release "kgateway-crds" does not exist. Installing it now.
Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway-crds:v2.0.0
Digest: sha256:b405a0fbca50ae816bba355f1133cb456f280d9925d824166b7b6fc4e96f2077
NAME: kgateway-crds
LAST DEPLOYED: Fri May 30 19:32:41 2025
NAMESPACE: kgateway-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
Release "kgateway" does not exist. Installing it now.
Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway:v2.0.0
Digest: sha256:bbd7559eaa05ef6c27382390768889f5475e75bdcb4bd81ebd0f770cd14ab7a8
NAME: kgateway
LAST DEPLOYED: Fri May 30 19:32:42 2025
NAMESPACE: kgateway-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
β
GAIE infra applied
βΉοΈ π¦ Creating namespace llm-d...
namespace/llm-d created
Context "minikube" modified.
β
Namespace ready
βΉοΈ πΉ Using merged values: /tmp/tmp.BJddhCr3Ka
βΉοΈ π Creating/updating HF token secret...
secret/llm-d-hf-token created
β
HF token secret created
βΉοΈ Fetching OCP proxy UID...
βΉοΈ No OpenShift SCC annotation found; defaulting PROXY_UID=0
βΉοΈ π Applying modelservice CRD...
customresourcedefinition.apiextensions.k8s.io/modelservices.llm-d.ai created
β
ModelService CRD applied
βΉοΈ βοΈ Model download to PVC skipped: BYO model via HF repo_id selected.
protocol hf chosen - models will be downloaded JIT in inferencing pods.
"bitnami" already exists with the same configuration, skipping
βΉοΈ π οΈ Building Helm chart dependencies...
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "prometheus-community" chart repository
...Successfully got an update from the "bitnami" chart repository
Update Complete. βHappy Helming!β
Saving 2 charts
Downloading common from repo https://charts.bitnami.com/bitnami
Downloading redis from repo https://charts.bitnami.com/bitnami
Pulled: registry-1.docker.io/bitnamicharts/redis:20.13.4
Digest: sha256:6a389e13237e8e639ec0d445e785aa246b57bfce711b087033a196a291d5c8d7
Deleting outdated charts
β
Dependencies built
βΉοΈ Metrics collection disabled by user request
βΉοΈ Metrics collection disabled by user request
βΉοΈ Metrics collection disabled by user request
βΉοΈ π Deploying llm-d chart with /tmp/tmp.BJddhCr3Ka...
Release "llm-d" does not exist. Installing it now.
NAME: llm-d
LAST DEPLOYED: Fri May 30 19:32:50 2025
NAMESPACE: llm-d
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Thank you for installing llm-d.
Your release is named `llm-d`.
To learn more about the release, try:
```bash
$ helm status llm-d
$ helm get all llm-d
Following presets are available to your users:
Name | Description |
---|---|
basic-gpu-preset | Basic gpu inference |
basic-gpu-with-nixl-preset | GPU inference with NIXL P/D KV transfer and cache offloading |
basic-gpu-with-nixl-and-redis-lookup-preset | GPU inference with NIXL P/D KV transfer, cache offloading and Redis lookup server |
basic-sim-preset | Basic simulation |
β llm-d deployed | |
β π Installation complete. | |
Installation command for no-features sent. |
--- Waiting for vLLM instance for deployment 'no-features' to initialize --- βΉοΈ This step can take some time, as it may involve downloading large model files and then initializing the vLLM engine. Attempt 1/60: Waiting for 'no-features' to be ready. Retrying in 30 seconds... Attempt 2/60: Waiting for 'no-features' to be ready. Retrying in 30 seconds... Attempt 3/60: Waiting for 'no-features' to be ready. Retrying in 30 seconds... Attempt 4/60: Waiting for 'no-features' to be ready. Retrying in 30 seconds... Attempt 5/60: Waiting for 'no-features' to be ready. Retrying in 30 seconds... Attempt 6/60: Waiting for 'no-features' to be ready. Retrying in 30 seconds... Attempt 7/60: Waiting for 'no-features' to be ready. Retrying in 30 seconds... Attempt 8/60: Waiting for 'no-features' to be ready. Retrying in 30 seconds... Attempt 9/60: Waiting for 'no-features' to be ready. Retrying in 30 seconds... β vLLM instance for deployment 'no-features' is ready!
--- Running Benchmark for: no-features (Model: meta-llama/Llama-3.2-3B-Instruct, Prefill: 0, Decode: 4, Input: 1000, Output: 500) ---
Metadata: deployment=no-features gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500
Result File: results.json
π EXEC: ./run-bench.sh --model meta-llama/Llama-3.2-3B-Instruct --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --dataset-name random --input-len 1000 --output-len 500 --request-rates 10,30,inf --metadata deployment=no-features gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 --result-file results.json
secret/hf-token-secret created
βΆοΈ Benchmarking MODEL=meta-llama/Llama-3.2-3B-Instruct at rates: 10 30 inf QPS for 30 seconds each
π Results will go into ./results.json
π Launching vllm-bench-job-10qps (QPS=10, prompts=300)β¦
job.batch/vllm-bench-job-10qps created
job.batch/vllm-bench-job-10qps condition met
π Logs from vllm-bench-job-10qps:
Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN
Starting benchmark at Fri May 30 19:46:38 UTC 2025
----- ENV VARS -----
BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80
MODEL=meta-llama/Llama-3.2-3B-Instruct
DATASET_NAME=random
RANDOM_INPUT_LEN=1000
RANDOM_OUTPUT_LEN=500
REQUEST_RATE=10
NUM_PROMPTS=300
IGNORE_EOS=true
RESULT_FILENAME=results.json
METADATA=deployment=no-features gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500
Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate 10 --num-prompts 300 --save-result --result-filename results.json --metadata deployment=no-features gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 19:46:47 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 19:46:47 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 19:46:47 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 19:46:47 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 19:46:47 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 19:46:47 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 19:46:47 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 19:46:47 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 19:46:48 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=300, logprobs=None, request_rate=10.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=no-features', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: 10.0 Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|ββββββββββ| 300/300 [00:35<00:00, 8.37it/s] ============ Serving Benchmark Result ============ Successful requests: 300 Benchmark duration (s): 35.84 Total input tokens: 299700 Total generated tokens: 80945 Request throughput (req/s): 8.37 Output token throughput (tok/s): 2258.21 Total Token throughput (tok/s): 10619.28 ---------------Time to First Token---------------- Mean TTFT (ms): 54.60 Median TTFT (ms): 52.87 P99 TTFT (ms): 98.70 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 13.71 Median TPOT (ms): 13.69 P99 TPOT (ms): 15.79 ---------------Inter-token Latency---------------- Mean ITL (ms): 13.73 Median ITL (ms): 13.10 P99 ITL (ms): 34.49
<<<RESULT_START>>> {"date": "20250530-194738", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 300, "deployment": "no-features", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 10.0, "burstiness": 1.0, "max_concurrency": null, "duration": 35.84472489100017, "completed": 300, "total_input_tokens": 299700, "total_output_tokens": 80945, "request_throughput": 8.369432347779671, "request_goodput:": null, "output_throughput": 2258.212337970085, "total_token_throughput": 10619.275253401978, "mean_ttft_ms": 54.59561144333596, "median_ttft_ms": 52.868896000063614, "std_ttft_ms": 10.45418108071603, "p99_ttft_ms": 98.69780922984769, "mean_tpot_ms": 13.710971568239172, "median_tpot_ms": 13.686303201518745, "std_tpot_ms": 0.9966270621573491, "p99_tpot_ms": 15.793261194984588, "mean_itl_ms": 13.73187330695019, "median_itl_ms": 13.098381999952835, "std_itl_ms": 4.023339545674027, "p99_itl_ms": 34.48662216016601} <<<RESULT_END>>> Appended results block for 10 QPS Cleaning up Job vllm-bench-job-10qps... job.batch "vllm-bench-job-10qps" deleted π Launching vllm-bench-job-30qps (QPS=30, prompts=900)β¦ job.batch/vllm-bench-job-30qps created job.batch/vllm-bench-job-30qps condition met π Logs from vllm-bench-job-30qps: Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN Starting benchmark at Fri May 30 19:47:42 UTC 2025 ----- ENV VARS ----- BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 MODEL=meta-llama/Llama-3.2-3B-Instruct DATASET_NAME=random RANDOM_INPUT_LEN=1000 RANDOM_OUTPUT_LEN=500 REQUEST_RATE=30 NUM_PROMPTS=900 IGNORE_EOS=true RESULT_FILENAME=results.json METADATA=deployment=no-features gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500
Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate 30 --num-prompts 900 --save-result --result-filename results.json --metadata deployment=no-features gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 19:47:51 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 19:47:51 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 19:47:51 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 19:47:51 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 19:47:51 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 19:47:51 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 19:47:51 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 19:47:51 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 19:47:52 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=900, logprobs=None, request_rate=30.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=no-features', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: 30.0 Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|ββββββββββ| 900/900 [00:40<00:00, 22.10it/s] ============ Serving Benchmark Result ============ Successful requests: 900 Benchmark duration (s): 40.72 Total input tokens: 899100 Total generated tokens: 237873 Request throughput (req/s): 22.10 Output token throughput (tok/s): 5841.36 Total Token throughput (tok/s): 27920.21 ---------------Time to First Token---------------- Mean TTFT (ms): 80.19 Median TTFT (ms): 76.65 P99 TTFT (ms): 171.68 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 24.94 Median TPOT (ms): 25.47 P99 TPOT (ms): 35.58 ---------------Inter-token Latency---------------- Mean ITL (ms): 24.81 Median ITL (ms): 21.83 P99 ITL (ms): 74.24
<<<RESULT_START>>> {"date": "20250530-194848", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "no-features", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 30.0, "burstiness": 1.0, "max_concurrency": null, "duration": 40.722225800999695, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 237873, "request_throughput": 22.100953037245468, "request_goodput:": null, "output_throughput": 5841.355557587435, "total_token_throughput": 27920.20764179566, "mean_ttft_ms": 80.19281090889662, "median_ttft_ms": 76.65224249990388, "std_ttft_ms": 26.85547410843286, "p99_ttft_ms": 171.67865642984907, "mean_tpot_ms": 24.942996925599196, "median_tpot_ms": 25.466091678535058, "std_tpot_ms": 5.279276119470118, "p99_tpot_ms": 35.581390495641394, "mean_itl_ms": 24.814122015140974, "median_itl_ms": 21.825537999575317, "std_itl_ms": 11.660791168594844, "p99_itl_ms": 74.24358160013071} <<<RESULT_END>>> Appended results block for 30 QPS Cleaning up Job vllm-bench-job-30qps... job.batch "vllm-bench-job-30qps" deleted π Launching vllm-bench-job-inf (infinite QPS, prompts=900)β¦ job.batch/vllm-bench-job-inf created job.batch/vllm-bench-job-inf condition met π Logs from vllm-bench-job-inf: Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN Starting benchmark at Fri May 30 19:48:52 UTC 2025 ----- ENV VARS ----- BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 MODEL=meta-llama/Llama-3.2-3B-Instruct DATASET_NAME=random RANDOM_INPUT_LEN=1000 RANDOM_OUTPUT_LEN=500 REQUEST_RATE=inf NUM_PROMPTS=900 IGNORE_EOS=true RESULT_FILENAME=results.json METADATA=deployment=no-features gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500
Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate inf --num-prompts 900 --save-result --result-filename results.json --metadata deployment=no-features gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 19:49:02 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 19:49:02 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 19:49:02 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 19:49:02 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 19:49:02 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 19:49:02 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 19:49:02 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 19:49:02 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 19:49:03 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=900, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=no-features', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: inf Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|ββββββββββ| 900/900 [00:28<00:00, 31.18it/s] ============ Serving Benchmark Result ============ Successful requests: 900 Benchmark duration (s): 28.86 Total input tokens: 899100 Total generated tokens: 240165 Request throughput (req/s): 31.18 Output token throughput (tok/s): 8321.23 Total Token throughput (tok/s): 39473.22 ---------------Time to First Token---------------- Mean TTFT (ms): 4386.74 Median TTFT (ms): 4066.16 P99 TTFT (ms): 9951.54 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 61.81 Median TPOT (ms): 53.14 P99 TPOT (ms): 118.56 ---------------Inter-token Latency---------------- Mean ITL (ms): 48.17 Median ITL (ms): 37.26 P99 ITL (ms): 134.97
<<<RESULT_START>>> {"date": "20250530-194947", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "no-features", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 28.861716763000004, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 240165, "request_throughput": 31.183176225808488, "request_goodput:": null, "output_throughput": 8321.230575856995, "total_token_throughput": 39473.22362543967, "mean_ttft_ms": 4386.739448253334, "median_ttft_ms": 4066.1594384998807, "std_ttft_ms": 2469.091218460141, "p99_ttft_ms": 9951.535929819987, "mean_tpot_ms": 61.81461881712581, "median_tpot_ms": 53.13592476507745, "std_tpot_ms": 22.250639166144722, "p99_tpot_ms": 118.56057960224177, "mean_itl_ms": 48.16794500382424, "median_itl_ms": 37.264703000346344, "std_itl_ms": 26.871669857570474, "p99_itl_ms": 134.96996935997225} <<<RESULT_END>>> Appended results block for infinite QPS Cleaning up Job vllm-bench-job-inf... job.batch "vllm-bench-job-inf" deleted β All benchmarks complete. Combined results in ./results.json Benchmark for no-features completed.
--- Uninstalling LLM Deployment: no-features --- π EXEC: ./llmd-installer.sh --minikube --uninstall βΉοΈ π Setting up script environment... βΉοΈ kubectl can reach to a running Kubernetes cluster. βΉοΈ ποΈ Tearing down GAIE Kubernetes infrastructureβ¦ β π Base CRDs: Deleting... customresourcedefinition.apiextensions.k8s.io "gatewayclasses.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "gateways.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "grpcroutes.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "httproutes.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "referencegrants.gateway.networking.k8s.io" deleted β πͺ GAIE CRDs: Deleting... customresourcedefinition.apiextensions.k8s.io "inferencemodels.inference.networking.x-k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "inferencepools.inference.networking.x-k8s.io" deleted β π Gateway provider 'kgateway': Deleting... release "kgateway" uninstalled release "kgateway-crds" uninstalled Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway-crds:v2.0.0 Digest: sha256:b405a0fbca50ae816bba355f1133cb456f280d9925d824166b7b6fc4e96f2077 βΉοΈ ποΈ Uninstalling llm-d chart... release "llm-d" uninstalled βΉοΈ ποΈ Deleting namespace llm-d... namespace "llm-d" deleted βΉοΈ ποΈ Deleting monitoring namespace... βΉοΈ ποΈ Deleting Minikube hostPath PV (model-hostpath-pv)... βΉοΈ ποΈ Deleting ClusterRoleBinding llm-d No resources found β π Uninstallation complete Uninstallation for no-features completed.
Pausing for a few seconds before next deployment...
--- Installing LLM Deployment: base (using examples/base/base.yaml) --- π EXEC: ./llmd-installer.sh --minikube --values-file examples/base/base.yaml --disable-metrics-collection βΉοΈ π Setting up script environment... βΉοΈ kubectl can reach to a running Kubernetes cluster. β HF_TOKEN validated βΉοΈ ποΈ Installing GAIE Kubernetes infrastructureβ¦ β π Base CRDs: Installing... customresourcedefinition.apiextensions.k8s.io/gatewayclasses.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/gateways.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/grpcroutes.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/httproutes.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/referencegrants.gateway.networking.k8s.io created β πͺ GAIE CRDs: Installing... customresourcedefinition.apiextensions.k8s.io/inferencemodels.inference.networking.x-k8s.io created customresourcedefinition.apiextensions.k8s.io/inferencepools.inference.networking.x-k8s.io created β π Gateway provider 'kgateway': Installing... Release "kgateway-crds" does not exist. Installing it now. Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway-crds:v2.0.0 Digest: sha256:b405a0fbca50ae816bba355f1133cb456f280d9925d824166b7b6fc4e96f2077 NAME: kgateway-crds LAST DEPLOYED: Fri May 30 19:50:09 2025 NAMESPACE: kgateway-system STATUS: deployed REVISION: 1 TEST SUITE: None Release "kgateway" does not exist. Installing it now. Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway:v2.0.0 Digest: sha256:bbd7559eaa05ef6c27382390768889f5475e75bdcb4bd81ebd0f770cd14ab7a8 NAME: kgateway LAST DEPLOYED: Fri May 30 19:50:10 2025 NAMESPACE: kgateway-system STATUS: deployed REVISION: 1 TEST SUITE: None β GAIE infra applied βΉοΈ π¦ Creating namespace llm-d... namespace/llm-d created Context "minikube" modified. β Namespace ready βΉοΈ πΉ Using merged values: /tmp/tmp.ECcof8e3Jm βΉοΈ π Creating/updating HF token secret... secret/llm-d-hf-token created β HF token secret created βΉοΈ Fetching OCP proxy UID... βΉοΈ No OpenShift SCC annotation found; defaulting PROXY_UID=0 βΉοΈ π Applying modelservice CRD... customresourcedefinition.apiextensions.k8s.io/modelservices.llm-d.ai unchanged β ModelService CRD applied βΉοΈ βοΈ Model download to PVC skipped: BYO model via HF repo_id selected. protocol hf chosen - models will be downloaded JIT in inferencing pods. "bitnami" already exists with the same configuration, skipping βΉοΈ π οΈ Building Helm chart dependencies... Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "prometheus-community" chart repository ...Successfully got an update from the "bitnami" chart repository Update Complete. βHappy Helming!β Saving 2 charts Downloading common from repo https://charts.bitnami.com/bitnami Downloading redis from repo https://charts.bitnami.com/bitnami Pulled: registry-1.docker.io/bitnamicharts/redis:20.13.4 Digest: sha256:6a389e13237e8e639ec0d445e785aa246b57bfce711b087033a196a291d5c8d7 Deleting outdated charts β Dependencies built βΉοΈ Metrics collection disabled by user request βΉοΈ Metrics collection disabled by user request βΉοΈ Metrics collection disabled by user request βΉοΈ π Deploying llm-d chart with /tmp/tmp.ECcof8e3Jm... Release "llm-d" does not exist. Installing it now. NAME: llm-d LAST DEPLOYED: Fri May 30 19:50:18 2025 NAMESPACE: llm-d STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: Thank you for installing llm-d.
Your release is named llm-d
.
To learn more about the release, try:
$ helm status llm-d
$ helm get all llm-d
Following presets are available to your users:
Name | Description |
---|---|
basic-gpu-preset | Basic gpu inference |
basic-gpu-with-nixl-preset | GPU inference with NIXL P/D KV transfer and cache offloading |
basic-gpu-with-nixl-and-redis-lookup-preset | GPU inference with NIXL P/D KV transfer, cache offloading and Redis lookup server |
basic-sim-preset | Basic simulation |
β llm-d deployed | |
β π Installation complete. | |
Installation command for base sent. |
--- Waiting for vLLM instance for deployment 'base' to initialize --- βΉοΈ This step can take some time, as it may involve downloading large model files and then initializing the vLLM engine. Attempt 1/60: Waiting for 'base' to be ready. Retrying in 30 seconds... Attempt 2/60: Waiting for 'base' to be ready. Retrying in 30 seconds... Attempt 3/60: Waiting for 'base' to be ready. Retrying in 30 seconds... Attempt 4/60: Waiting for 'base' to be ready. Retrying in 30 seconds... Attempt 5/60: Waiting for 'base' to be ready. Retrying in 30 seconds... Attempt 6/60: Waiting for 'base' to be ready. Retrying in 30 seconds... Attempt 7/60: Waiting for 'base' to be ready. Retrying in 30 seconds... β vLLM instance for deployment 'base' is ready!
--- Running Benchmark for: base (Model: meta-llama/Llama-3.2-3B-Instruct, Prefill: 0, Decode: 4, Input: 1000, Output: 500) ---
Metadata: deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500
Result File: results.json
π EXEC: ./run-bench.sh --model meta-llama/Llama-3.2-3B-Instruct --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --dataset-name random --input-len 1000 --output-len 500 --request-rates 10,30,inf --metadata deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 --result-file results.json
secret/hf-token-secret created
βΆοΈ Benchmarking MODEL=meta-llama/Llama-3.2-3B-Instruct at rates: 10 30 inf QPS for 30 seconds each
π Results will go into ./results.json
π Launching vllm-bench-job-10qps (QPS=10, prompts=300)β¦
job.batch/vllm-bench-job-10qps created
job.batch/vllm-bench-job-10qps condition met
π Logs from vllm-bench-job-10qps:
Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN
Starting benchmark at Fri May 30 19:58:22 UTC 2025
----- ENV VARS -----
BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80
MODEL=meta-llama/Llama-3.2-3B-Instruct
DATASET_NAME=random
RANDOM_INPUT_LEN=1000
RANDOM_OUTPUT_LEN=500
REQUEST_RATE=10
NUM_PROMPTS=300
IGNORE_EOS=true
RESULT_FILENAME=results.json
METADATA=deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500
Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate 10 --num-prompts 300 --save-result --result-filename results.json --metadata deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 19:58:31 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 19:58:31 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 19:58:31 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 19:58:31 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 19:58:31 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 19:58:31 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 19:58:31 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 19:58:31 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 19:58:32 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=300, logprobs=None, request_rate=10.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=base', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: 10.0 Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|ββββββββββ| 300/300 [00:36<00:00, 8.29it/s] ============ Serving Benchmark Result ============ Successful requests: 300 Benchmark duration (s): 36.20 Total input tokens: 299700 Total generated tokens: 82608 Request throughput (req/s): 8.29 Output token throughput (tok/s): 2282.27 Total Token throughput (tok/s): 10562.32 ---------------Time to First Token---------------- Mean TTFT (ms): 54.59 Median TTFT (ms): 52.79 P99 TTFT (ms): 89.80 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 13.81 Median TPOT (ms): 13.62 P99 TPOT (ms): 17.14 ---------------Inter-token Latency---------------- Mean ITL (ms): 13.79 Median ITL (ms): 13.05 P99 ITL (ms): 34.87
<<<RESULT_START>>> {"date": "20250530-195922", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 300, "deployment": "base", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 10.0, "burstiness": 1.0, "max_concurrency": null, "duration": 36.19546746500009, "completed": 300, "total_input_tokens": 299700, "total_output_tokens": 82608, "request_throughput": 8.288330584211705, "request_goodput:": null, "output_throughput": 2282.2747096685353, "total_token_throughput": 10562.316963296029, "mean_ttft_ms": 54.592403276680365, "median_ttft_ms": 52.79431650001243, "std_ttft_ms": 9.218280749895815, "p99_ttft_ms": 89.80040712966002, "mean_tpot_ms": 13.81029661989983, "median_tpot_ms": 13.619587584448205, "std_tpot_ms": 1.1567995163110574, "p99_tpot_ms": 17.1377381639993, "mean_itl_ms": 13.787254072508166, "median_itl_ms": 13.049414999841247, "std_itl_ms": 4.111252531976249, "p99_itl_ms": 34.86666600036186} <<<RESULT_END>>> Appended results block for 10 QPS Cleaning up Job vllm-bench-job-10qps... job.batch "vllm-bench-job-10qps" deleted π Launching vllm-bench-job-30qps (QPS=30, prompts=900)β¦ job.batch/vllm-bench-job-30qps created job.batch/vllm-bench-job-30qps condition met π Logs from vllm-bench-job-30qps: Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN Starting benchmark at Fri May 30 19:59:26 UTC 2025 ----- ENV VARS ----- BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 MODEL=meta-llama/Llama-3.2-3B-Instruct DATASET_NAME=random RANDOM_INPUT_LEN=1000 RANDOM_OUTPUT_LEN=500 REQUEST_RATE=30 NUM_PROMPTS=900 IGNORE_EOS=true RESULT_FILENAME=results.json METADATA=deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500
Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate 30 --num-prompts 900 --save-result --result-filename results.json --metadata deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 19:59:35 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 19:59:35 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 19:59:35 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 19:59:35 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 19:59:35 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 19:59:35 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 19:59:35 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 19:59:35 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 19:59:37 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=900, logprobs=None, request_rate=30.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=base', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: 30.0 Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|ββββββββββ| 900/900 [00:40<00:00, 22.07it/s] ============ Serving Benchmark Result ============ Successful requests: 900 Benchmark duration (s): 40.77 Total input tokens: 899100 Total generated tokens: 240128 Request throughput (req/s): 22.07 Output token throughput (tok/s): 5889.41 Total Token throughput (tok/s): 27940.84 ---------------Time to First Token---------------- Mean TTFT (ms): 73.28 Median TTFT (ms): 72.59 P99 TTFT (ms): 182.99 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 23.98 Median TPOT (ms): 24.20 P99 TPOT (ms): 35.32 ---------------Inter-token Latency---------------- Mean ITL (ms): 23.91 Median ITL (ms): 21.30 P99 ITL (ms): 75.28
<<<RESULT_START>>> {"date": "20250530-200032", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "base", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 30.0, "burstiness": 1.0, "max_concurrency": null, "duration": 40.77285659999961, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 240128, "request_throughput": 22.073508580215805, "request_goodput:": null, "output_throughput": 5889.408298166733, "total_token_throughput": 27940.843369802322, "mean_ttft_ms": 73.27519322778042, "median_ttft_ms": 72.58796349969998, "std_ttft_ms": 34.88893424361722, "p99_ttft_ms": 182.99465706019868, "mean_tpot_ms": 23.98439489090563, "median_tpot_ms": 24.200584621242403, "std_tpot_ms": 6.2426528186199635, "p99_tpot_ms": 35.31807143475544, "mean_itl_ms": 23.91148948750557, "median_itl_ms": 21.303171499766904, "std_itl_ms": 11.541504488505524, "p99_itl_ms": 75.27688499983014} <<<RESULT_END>>> Appended results block for 30 QPS Cleaning up Job vllm-bench-job-30qps... job.batch "vllm-bench-job-30qps" deleted π Launching vllm-bench-job-inf (infinite QPS, prompts=900)β¦ job.batch/vllm-bench-job-inf created job.batch/vllm-bench-job-inf condition met π Logs from vllm-bench-job-inf: Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN Starting benchmark at Fri May 30 20:00:36 UTC 2025 ----- ENV VARS ----- BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 MODEL=meta-llama/Llama-3.2-3B-Instruct DATASET_NAME=random RANDOM_INPUT_LEN=1000 RANDOM_OUTPUT_LEN=500 REQUEST_RATE=inf NUM_PROMPTS=900 IGNORE_EOS=true RESULT_FILENAME=results.json METADATA=deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500
Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate inf --num-prompts 900 --save-result --result-filename results.json --metadata deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 20:00:46 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 20:00:46 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 20:00:46 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 20:00:46 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 20:00:46 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 20:00:46 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 20:00:46 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 20:00:46 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 20:00:47 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=900, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=base', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: inf Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|ββββββββββ| 900/900 [00:23<00:00, 37.76it/s] ============ Serving Benchmark Result ============ Successful requests: 900 Benchmark duration (s): 23.83 Total input tokens: 899100 Total generated tokens: 241786 Request throughput (req/s): 37.76 Output token throughput (tok/s): 10145.34 Total Token throughput (tok/s): 47871.58 ---------------Time to First Token---------------- Mean TTFT (ms): 1253.67 Median TTFT (ms): 937.95 P99 TTFT (ms): 3155.81 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 46.33 Median TPOT (ms): 43.75 P99 TPOT (ms): 97.13 ---------------Inter-token Latency---------------- Mean ITL (ms): 40.97 Median ITL (ms): 37.63 P99 ITL (ms): 86.79
<<<RESULT_START>>> {"date": "20250530-200126", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "base", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 23.8322206619996, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 241786, "request_throughput": 37.76400079389359, "request_goodput:": null, "output_throughput": 10145.340773280393, "total_token_throughput": 47871.57756638009, "mean_ttft_ms": 1253.6749797966502, "median_ttft_ms": 937.9455774997041, "std_ttft_ms": 750.4385642809029, "p99_ttft_ms": 3155.812861540362, "mean_tpot_ms": 46.333835858714664, "median_tpot_ms": 43.749258578274386, "std_tpot_ms": 12.805014931694936, "p99_tpot_ms": 97.13205872092182, "mean_itl_ms": 40.968254144283954, "median_itl_ms": 37.628544499966665, "std_itl_ms": 11.948914401781659, "p99_itl_ms": 86.78916550024948} <<<RESULT_END>>> Appended results block for infinite QPS Cleaning up Job vllm-bench-job-inf... job.batch "vllm-bench-job-inf" deleted β All benchmarks complete. Combined results in ./results.json Benchmark for base completed.
--- Uninstalling LLM Deployment: base --- π EXEC: ./llmd-installer.sh --minikube --uninstall βΉοΈ π Setting up script environment... βΉοΈ kubectl can reach to a running Kubernetes cluster. βΉοΈ ποΈ Tearing down GAIE Kubernetes infrastructureβ¦ β π Base CRDs: Deleting... customresourcedefinition.apiextensions.k8s.io "gatewayclasses.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "gateways.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "grpcroutes.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "httproutes.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "referencegrants.gateway.networking.k8s.io" deleted β πͺ GAIE CRDs: Deleting... customresourcedefinition.apiextensions.k8s.io "inferencemodels.inference.networking.x-k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "inferencepools.inference.networking.x-k8s.io" deleted β π Gateway provider 'kgateway': Deleting... release "kgateway" uninstalled release "kgateway-crds" uninstalled Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway-crds:v2.0.0 Digest: sha256:b405a0fbca50ae816bba355f1133cb456f280d9925d824166b7b6fc4e96f2077 βΉοΈ ποΈ Uninstalling llm-d chart... release "llm-d" uninstalled βΉοΈ ποΈ Deleting namespace llm-d... namespace "llm-d" deleted βΉοΈ ποΈ Deleting monitoring namespace... βΉοΈ ποΈ Deleting Minikube hostPath PV (model-hostpath-pv)... βΉοΈ ποΈ Deleting ClusterRoleBinding llm-d No resources found β π Uninstallation complete Uninstallation for base completed.
Pausing for a few seconds before next deployment...
--- Installing LLM Deployment: kvcache (using examples/kvcache/kvcache.yaml) --- π EXEC: ./llmd-installer.sh --minikube --values-file examples/kvcache/kvcache.yaml --disable-metrics-collection βΉοΈ π Setting up script environment... βΉοΈ kubectl can reach to a running Kubernetes cluster. β HF_TOKEN validated βΉοΈ ποΈ Installing GAIE Kubernetes infrastructureβ¦ β π Base CRDs: Installing... customresourcedefinition.apiextensions.k8s.io/gatewayclasses.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/gateways.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/grpcroutes.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/httproutes.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/referencegrants.gateway.networking.k8s.io created β πͺ GAIE CRDs: Installing... customresourcedefinition.apiextensions.k8s.io/inferencemodels.inference.networking.x-k8s.io created customresourcedefinition.apiextensions.k8s.io/inferencepools.inference.networking.x-k8s.io created β π Gateway provider 'kgateway': Installing... Release "kgateway-crds" does not exist. Installing it now. Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway-crds:v2.0.0 Digest: sha256:b405a0fbca50ae816bba355f1133cb456f280d9925d824166b7b6fc4e96f2077 NAME: kgateway-crds LAST DEPLOYED: Fri May 30 20:01:46 2025 NAMESPACE: kgateway-system STATUS: deployed REVISION: 1 TEST SUITE: None Release "kgateway" does not exist. Installing it now. Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway:v2.0.0 Digest: sha256:bbd7559eaa05ef6c27382390768889f5475e75bdcb4bd81ebd0f770cd14ab7a8 NAME: kgateway LAST DEPLOYED: Fri May 30 20:01:48 2025 NAMESPACE: kgateway-system STATUS: deployed REVISION: 1 TEST SUITE: None β GAIE infra applied βΉοΈ π¦ Creating namespace llm-d... namespace/llm-d created Context "minikube" modified. β Namespace ready βΉοΈ πΉ Using merged values: /tmp/tmp.xLX5pNIysj βΉοΈ π Creating/updating HF token secret... secret/llm-d-hf-token created β HF token secret created βΉοΈ Fetching OCP proxy UID... βΉοΈ No OpenShift SCC annotation found; defaulting PROXY_UID=0 βΉοΈ π Applying modelservice CRD... customresourcedefinition.apiextensions.k8s.io/modelservices.llm-d.ai unchanged β ModelService CRD applied βΉοΈ βοΈ Model download to PVC skipped: BYO model via HF repo_id selected. protocol hf chosen - models will be downloaded JIT in inferencing pods. "bitnami" already exists with the same configuration, skipping βΉοΈ π οΈ Building Helm chart dependencies... Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "prometheus-community" chart repository ...Successfully got an update from the "bitnami" chart repository Update Complete. βHappy Helming!β Saving 2 charts Downloading common from repo https://charts.bitnami.com/bitnami Downloading redis from repo https://charts.bitnami.com/bitnami Pulled: registry-1.docker.io/bitnamicharts/redis:20.13.4 Digest: sha256:6a389e13237e8e639ec0d445e785aa246b57bfce711b087033a196a291d5c8d7 Deleting outdated charts β Dependencies built βΉοΈ Metrics collection disabled by user request βΉοΈ Metrics collection disabled by user request βΉοΈ Metrics collection disabled by user request βΉοΈ π Deploying llm-d chart with /tmp/tmp.xLX5pNIysj... Release "llm-d" does not exist. Installing it now. NAME: llm-d LAST DEPLOYED: Fri May 30 20:01:56 2025 NAMESPACE: llm-d STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: Thank you for installing llm-d.
Your release is named llm-d
.
To learn more about the release, try:
$ helm status llm-d
$ helm get all llm-d
Following presets are available to your users:
Name | Description |
---|---|
basic-gpu-preset | Basic gpu inference |
basic-gpu-with-nixl-preset | GPU inference with NIXL P/D KV transfer and cache offloading |
basic-gpu-with-nixl-and-redis-lookup-preset | GPU inference with NIXL P/D KV transfer, cache offloading and Redis lookup server |
basic-sim-preset | Basic simulation |
β llm-d deployed | |
β π Installation complete. | |
Installation command for kvcache sent. |
--- Waiting for vLLM instance for deployment 'kvcache' to initialize --- βΉοΈ This step can take some time, as it may involve downloading large model files and then initializing the vLLM engine. Attempt 1/60: Waiting for 'kvcache' to be ready. Retrying in 30 seconds... Attempt 2/60: Waiting for 'kvcache' to be ready. Retrying in 30 seconds... Attempt 3/60: Waiting for 'kvcache' to be ready. Retrying in 30 seconds... Attempt 4/60: Waiting for 'kvcache' to be ready. Retrying in 30 seconds... Attempt 5/60: Waiting for 'kvcache' to be ready. Retrying in 30 seconds... Attempt 6/60: Waiting for 'kvcache' to be ready. Retrying in 30 seconds... β vLLM instance for deployment 'kvcache' is ready!
--- Running Benchmark for: kvcache (Model: meta-llama/Llama-3.2-3B-Instruct, Prefill: 0, Decode: 4, Input: 1000, Output: 500) ---
Metadata: deployment=kvcache gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500
Result File: results.json
π EXEC: ./run-bench.sh --model meta-llama/Llama-3.2-3B-Instruct --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --dataset-name random --input-len 1000 --output-len 500 --request-rates 10,30,inf --metadata deployment=kvcache gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 --result-file results.json
secret/hf-token-secret created
βΆοΈ Benchmarking MODEL=meta-llama/Llama-3.2-3B-Instruct at rates: 10 30 inf QPS for 30 seconds each
π Results will go into ./results.json
π Launching vllm-bench-job-10qps (QPS=10, prompts=300)β¦
job.batch/vllm-bench-job-10qps created
job.batch/vllm-bench-job-10qps condition met
π Logs from vllm-bench-job-10qps:
Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN
Starting benchmark at Fri May 30 20:10:01 UTC 2025
----- ENV VARS -----
BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80
MODEL=meta-llama/Llama-3.2-3B-Instruct
DATASET_NAME=random
RANDOM_INPUT_LEN=1000
RANDOM_OUTPUT_LEN=500
REQUEST_RATE=10
NUM_PROMPTS=300
IGNORE_EOS=true
RESULT_FILENAME=results.json
METADATA=deployment=kvcache gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500
Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate 10 --num-prompts 300 --save-result --result-filename results.json --metadata deployment=kvcache gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 20:10:11 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 20:10:11 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 20:10:11 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 20:10:11 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 20:10:11 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 20:10:11 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 20:10:11 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 20:10:11 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 20:10:12 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=300, logprobs=None, request_rate=10.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=kvcache', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: 10.0 Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|ββββββββββ| 300/300 [00:35<00:00, 8.34it/s] ============ Serving Benchmark Result ============ Successful requests: 300 Benchmark duration (s): 35.98 Total input tokens: 299700 Total generated tokens: 81393 Request throughput (req/s): 8.34 Output token throughput (tok/s): 2262.19 Total Token throughput (tok/s): 10591.88 ---------------Time to First Token---------------- Mean TTFT (ms): 53.96 Median TTFT (ms): 52.87 P99 TTFT (ms): 87.51 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 13.56 Median TPOT (ms): 13.46 P99 TPOT (ms): 15.65 ---------------Inter-token Latency---------------- Mean ITL (ms): 13.50 Median ITL (ms): 12.90 P99 ITL (ms): 34.21
<<<RESULT_START>>> {"date": "20250530-201101", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 300, "deployment": "kvcache", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 10.0, "burstiness": 1.0, "max_concurrency": null, "duration": 35.97974775400053, "completed": 300, "total_input_tokens": 299700, "total_output_tokens": 81393, "request_throughput": 8.338023992028779, "request_goodput:": null, "output_throughput": 2262.189289277328, "total_token_throughput": 10591.87525731408, "mean_ttft_ms": 53.96254838997568, "median_ttft_ms": 52.86790049967749, "std_ttft_ms": 8.636840862111338, "p99_ttft_ms": 87.51494646001444, "mean_tpot_ms": 13.562497126015346, "median_tpot_ms": 13.460283402405013, "std_tpot_ms": 1.0343708502449698, "p99_tpot_ms": 15.645541584462748, "mean_itl_ms": 13.503936382745758, "median_itl_ms": 12.901992000479368, "std_itl_ms": 3.8045600847990464, "p99_itl_ms": 34.20889711982454} <<<RESULT_END>>> Appended results block for 10 QPS Cleaning up Job vllm-bench-job-10qps... job.batch "vllm-bench-job-10qps" deleted π Launching vllm-bench-job-30qps (QPS=30, prompts=900)β¦ job.batch/vllm-bench-job-30qps created job.batch/vllm-bench-job-30qps condition met π Logs from vllm-bench-job-30qps: Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN Starting benchmark at Fri May 30 20:11:06 UTC 2025 ----- ENV VARS ----- BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 MODEL=meta-llama/Llama-3.2-3B-Instruct DATASET_NAME=random RANDOM_INPUT_LEN=1000 RANDOM_OUTPUT_LEN=500 REQUEST_RATE=30 NUM_PROMPTS=900 IGNORE_EOS=true RESULT_FILENAME=results.json METADATA=deployment=kvcache gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500
Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate 30 --num-prompts 900 --save-result --result-filename results.json --metadata deployment=kvcache gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 20:11:15 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 20:11:15 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 20:11:15 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 20:11:15 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 20:11:15 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 20:11:15 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 20:11:15 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 20:11:15 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 20:11:16 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=900, logprobs=None, request_rate=30.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=kvcache', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: 30.0 Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|ββββββββββ| 900/900 [00:40<00:00, 21.99it/s] ============ Serving Benchmark Result ============ Successful requests: 900 Benchmark duration (s): 40.93 Total input tokens: 899100 Total generated tokens: 237626 Request throughput (req/s): 21.99 Output token throughput (tok/s): 5806.21 Total Token throughput (tok/s): 27775.05 ---------------Time to First Token---------------- Mean TTFT (ms): 72.16 Median TTFT (ms): 73.32 P99 TTFT (ms): 168.38 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 23.54 Median TPOT (ms): 24.67 P99 TPOT (ms): 32.86 ---------------Inter-token Latency---------------- Mean ITL (ms): 23.37 Median ITL (ms): 21.04 P99 ITL (ms): 75.26
<<<RESULT_START>>> {"date": "20250530-201214", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "kvcache", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 30.0, "burstiness": 1.0, "max_concurrency": null, "duration": 40.92615340700013, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 237626, "request_throughput": 21.99082799328171, "request_goodput:": null, "output_throughput": 5806.213880812844, "total_token_throughput": 27775.05104610127, "mean_ttft_ms": 72.16499725443909, "median_ttft_ms": 73.31621500043184, "std_ttft_ms": 32.75126651570117, "p99_ttft_ms": 168.3847288293964, "mean_tpot_ms": 23.541025315462797, "median_tpot_ms": 24.670025282634825, "std_tpot_ms": 5.525520582614772, "p99_tpot_ms": 32.858918094328644, "mean_itl_ms": 23.368492361219396, "median_itl_ms": 21.040741000433627, "std_itl_ms": 10.963675513599902, "p99_itl_ms": 75.25774875034585} <<<RESULT_END>>> Appended results block for 30 QPS Cleaning up Job vllm-bench-job-30qps... job.batch "vllm-bench-job-30qps" deleted π Launching vllm-bench-job-inf (infinite QPS, prompts=900)β¦ job.batch/vllm-bench-job-inf created job.batch/vllm-bench-job-inf condition met π Logs from vllm-bench-job-inf: Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN Starting benchmark at Fri May 30 20:12:17 UTC 2025 ----- ENV VARS ----- BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 MODEL=meta-llama/Llama-3.2-3B-Instruct DATASET_NAME=random RANDOM_INPUT_LEN=1000 RANDOM_OUTPUT_LEN=500 REQUEST_RATE=inf NUM_PROMPTS=900 IGNORE_EOS=true RESULT_FILENAME=results.json METADATA=deployment=kvcache gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500
Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate inf --num-prompts 900 --save-result --result-filename results.json --metadata deployment=kvcache gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 20:12:26 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 20:12:26 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 20:12:26 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 20:12:26 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 20:12:26 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 20:12:26 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 20:12:26 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 20:12:26 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 20:12:27 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=900, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=kvcache', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: inf Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|ββββββββββ| 900/900 [00:21<00:00, 41.22it/s] ============ Serving Benchmark Result ============ Successful requests: 900 Benchmark duration (s): 21.84 Total input tokens: 899100 Total generated tokens: 240439 Request throughput (req/s): 41.22 Output token throughput (tok/s): 11011.35 Total Token throughput (tok/s): 52187.30 ---------------Time to First Token---------------- Mean TTFT (ms): 884.96 Median TTFT (ms): 867.67 P99 TTFT (ms): 1249.64 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 45.02 Median TPOT (ms): 44.02 P99 TPOT (ms): 71.63 ---------------Inter-token Latency---------------- Mean ITL (ms): 40.44 Median ITL (ms): 37.50 P99 ITL (ms): 67.52
<<<RESULT_START>>> {"date": "20250530-201304", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "kvcache", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 21.835562008000124, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 240439, "request_throughput": 41.21716673334341, "request_goodput:": null, "output_throughput": 11011.349280220396, "total_token_throughput": 52187.29884683046, "mean_ttft_ms": 884.9588357166714, "median_ttft_ms": 867.6726769999732, "std_ttft_ms": 164.40779421335031, "p99_ttft_ms": 1249.6421772296253, "mean_tpot_ms": 45.0189959701786, "median_tpot_ms": 44.02160330742376, "std_tpot_ms": 8.692672848556585, "p99_tpot_ms": 71.62702592521168, "mean_itl_ms": 40.442648903097236, "median_itl_ms": 37.49666700059606, "std_itl_ms": 9.076602885769773, "p99_itl_ms": 67.51899562010284} <<<RESULT_END>>> Appended results block for infinite QPS Cleaning up Job vllm-bench-job-inf... job.batch "vllm-bench-job-inf" deleted β All benchmarks complete. Combined results in ./results.json Benchmark for kvcache completed.
--- Uninstalling LLM Deployment: kvcache --- π EXEC: ./llmd-installer.sh --minikube --uninstall βΉοΈ π Setting up script environment... βΉοΈ kubectl can reach to a running Kubernetes cluster. βΉοΈ ποΈ Tearing down GAIE Kubernetes infrastructureβ¦ β π Base CRDs: Deleting... customresourcedefinition.apiextensions.k8s.io "gatewayclasses.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "gateways.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "grpcroutes.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "httproutes.gateway.networking.k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "referencegrants.gateway.networking.k8s.io" deleted β πͺ GAIE CRDs: Deleting... customresourcedefinition.apiextensions.k8s.io "inferencemodels.inference.networking.x-k8s.io" deleted customresourcedefinition.apiextensions.k8s.io "inferencepools.inference.networking.x-k8s.io" deleted β π Gateway provider 'kgateway': Deleting... release "kgateway" uninstalled release "kgateway-crds" uninstalled Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway-crds:v2.0.0 Digest: sha256:b405a0fbca50ae816bba355f1133cb456f280d9925d824166b7b6fc4e96f2077 βΉοΈ ποΈ Uninstalling llm-d chart... release "llm-d" uninstalled βΉοΈ ποΈ Deleting namespace llm-d... namespace "llm-d" deleted βΉοΈ ποΈ Deleting monitoring namespace... βΉοΈ ποΈ Deleting Minikube hostPath PV (model-hostpath-pv)... βΉοΈ ποΈ Deleting ClusterRoleBinding llm-d No resources found β π Uninstallation complete Uninstallation for kvcache completed.
π All configured deployments processed. ========= Full Deployment and Benchmark Process Finished =========
ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$ cat results.json
{"date": "20250530-194738", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 300, "deployment": "no-features", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 10.0, "burstiness": 1.0, "max_concurrency": null, "duration": 35.84472489100017, "completed": 300, "total_input_tokens": 299700, "total_output_tokens": 80945, "request_throughput": 8.369432347779671, "request_goodput:": null, "output_throughput": 2258.212337970085, "total_token_throughput": 10619.275253401978, "mean_ttft_ms": 54.59561144333596, "median_ttft_ms": 52.868896000063614, "std_ttft_ms": 10.45418108071603, "p99_ttft_ms": 98.69780922984769, "mean_tpot_ms": 13.710971568239172, "median_tpot_ms": 13.686303201518745, "std_tpot_ms": 0.9966270621573491, "p99_tpot_ms": 15.793261194984588, "mean_itl_ms": 13.73187330695019, "median_itl_ms": 13.098381999952835, "std_itl_ms": 4.023339545674027, "p99_itl_ms": 34.48662216016601}
{"date": "20250530-194848", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "no-features", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 30.0, "burstiness": 1.0, "max_concurrency": null, "duration": 40.722225800999695, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 237873, "request_throughput": 22.100953037245468, "request_goodput:": null, "output_throughput": 5841.355557587435, "total_token_throughput": 27920.20764179566, "mean_ttft_ms": 80.19281090889662, "median_ttft_ms": 76.65224249990388, "std_ttft_ms": 26.85547410843286, "p99_ttft_ms": 171.67865642984907, "mean_tpot_ms": 24.942996925599196, "median_tpot_ms": 25.466091678535058, "std_tpot_ms": 5.279276119470118, "p99_tpot_ms": 35.581390495641394, "mean_itl_ms": 24.814122015140974, "median_itl_ms": 21.825537999575317, "std_itl_ms": 11.660791168594844, "p99_itl_ms": 74.24358160013071}
{"date": "20250530-194947", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "no-features", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 28.861716763000004, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 240165, "request_throughput": 31.183176225808488, "request_goodput:": null, "output_throughput": 8321.230575856995, "total_token_throughput": 39473.22362543967, "mean_ttft_ms": 4386.739448253334, "median_ttft_ms": 4066.1594384998807, "std_ttft_ms": 2469.091218460141, "p99_ttft_ms": 9951.535929819987, "mean_tpot_ms": 61.81461881712581, "median_tpot_ms": 53.13592476507745, "std_tpot_ms": 22.250639166144722, "p99_tpot_ms": 118.56057960224177, "mean_itl_ms": 48.16794500382424, "median_itl_ms": 37.264703000346344, "std_itl_ms": 26.871669857570474, "p99_itl_ms": 134.96996935997225}
{"date": "20250530-195922", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 300, "deployment": "base", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 10.0, "burstiness": 1.0, "max_concurrency": null, "duration": 36.19546746500009, "completed": 300, "total_input_tokens": 299700, "total_output_tokens": 82608, "request_throughput": 8.288330584211705, "request_goodput:": null, "output_throughput": 2282.2747096685353, "total_token_throughput": 10562.316963296029, "mean_ttft_ms": 54.592403276680365, "median_ttft_ms": 52.79431650001243, "std_ttft_ms": 9.218280749895815, "p99_ttft_ms": 89.80040712966002, "mean_tpot_ms": 13.81029661989983, "median_tpot_ms": 13.619587584448205, "std_tpot_ms": 1.1567995163110574, "p99_tpot_ms": 17.1377381639993, "mean_itl_ms": 13.787254072508166, "median_itl_ms": 13.049414999841247, "std_itl_ms": 4.111252531976249, "p99_itl_ms": 34.86666600036186}
{"date": "20250530-200032", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "base", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 30.0, "burstiness": 1.0, "max_concurrency": null, "duration": 40.77285659999961, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 240128, "request_throughput": 22.073508580215805, "request_goodput:": null, "output_throughput": 5889.408298166733, "total_token_throughput": 27940.843369802322, "mean_ttft_ms": 73.27519322778042, "median_ttft_ms": 72.58796349969998, "std_ttft_ms": 34.88893424361722, "p99_ttft_ms": 182.99465706019868, "mean_tpot_ms": 23.98439489090563, "median_tpot_ms": 24.200584621242403, "std_tpot_ms": 6.2426528186199635, "p99_tpot_ms": 35.31807143475544, "mean_itl_ms": 23.91148948750557, "median_itl_ms": 21.303171499766904, "std_itl_ms": 11.541504488505524, "p99_itl_ms": 75.27688499983014}
{"date": "20250530-200126", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "base", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 23.8322206619996, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 241786, "request_throughput": 37.76400079389359, "request_goodput:": null, "output_throughput": 10145.340773280393, "total_token_throughput": 47871.57756638009, "mean_ttft_ms": 1253.6749797966502, "median_ttft_ms": 937.9455774997041, "std_ttft_ms": 750.4385642809029, "p99_ttft_ms": 3155.812861540362, "mean_tpot_ms": 46.333835858714664, "median_tpot_ms": 43.749258578274386, "std_tpot_ms": 12.805014931694936, "p99_tpot_ms": 97.13205872092182, "mean_itl_ms": 40.968254144283954, "median_itl_ms": 37.628544499966665, "std_itl_ms": 11.948914401781659, "p99_itl_ms": 86.78916550024948}
{"date": "20250530-201101", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 300, "deployment": "kvcache", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 10.0, "burstiness": 1.0, "max_concurrency": null, "duration": 35.97974775400053, "completed": 300, "total_input_tokens": 299700, "total_output_tokens": 81393, "request_throughput": 8.338023992028779, "request_goodput:": null, "output_throughput": 2262.189289277328, "total_token_throughput": 10591.87525731408, "mean_ttft_ms": 53.96254838997568, "median_ttft_ms": 52.86790049967749, "std_ttft_ms": 8.636840862111338, "p99_ttft_ms": 87.51494646001444, "mean_tpot_ms": 13.562497126015346, "median_tpot_ms": 13.460283402405013, "std_tpot_ms": 1.0343708502449698, "p99_tpot_ms": 15.645541584462748, "mean_itl_ms": 13.503936382745758, "median_itl_ms": 12.901992000479368, "std_itl_ms": 3.8045600847990464, "p99_itl_ms": 34.20889711982454}
{"date": "20250530-201214", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "kvcache", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 30.0, "burstiness": 1.0, "max_concurrency": null, "duration": 40.92615340700013, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 237626, "request_throughput": 21.99082799328171, "request_goodput:": null, "output_throughput": 5806.213880812844, "total_token_throughput": 27775.05104610127, "mean_ttft_ms": 72.16499725443909, "median_ttft_ms": 73.31621500043184, "std_ttft_ms": 32.75126651570117, "p99_ttft_ms": 168.3847288293964, "mean_tpot_ms": 23.541025315462797, "median_tpot_ms": 24.670025282634825, "std_tpot_ms": 5.525520582614772, "p99_tpot_ms": 32.858918094328644, "mean_itl_ms": 23.368492361219396, "median_itl_ms": 21.040741000433627, "std_itl_ms": 10.963675513599902, "p99_itl_ms": 75.25774875034585}
{"date": "20250530-201304", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "kvcache", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 21.835562008000124, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 240439, "request_throughput": 41.21716673334341, "request_goodput:": null, "output_throughput": 11011.349280220396, "total_token_throughput": 52187.29884683046, "mean_ttft_ms": 884.9588357166714, "median_ttft_ms": 867.6726769999732, "std_ttft_ms": 164.40779421335031, "p99_ttft_ms": 1249.6421772296253, "mean_tpot_ms": 45.0189959701786, "median_tpot_ms": 44.02160330742376, "std_tpot_ms": 8.692672848556585, "p99_tpot_ms": 71.62702592521168, "mean_itl_ms": 40.442648903097236, "median_itl_ms": 37.49666700059606, "std_itl_ms": 9.076602885769773, "p99_itl_ms": 67.51899562010284}
ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$
ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$ echo > results.json
ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$
ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$ ls
README-minikube.md e2e-bench-control.sh grafana infra istio-test-request.sh metrics-overview.md results.json.all run-bench.sh
README.md examples grafana-setup.md install-deps.sh llmd-installer.sh results.json results.json.bak test-request.sh
ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$ ./run-bench.sh --model meta-llama/Llama-3.2-3B-Instruct /llm-d-deployer/quickstart$ ^C
ubuntu@ip-172-31-16-33:
--base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80
--dataset-name random
--input-len 1000
--output-len 500
--request-rates 10,30,inf
--metadata "deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500"
--result-file results.json
namespace/llm-d created
secret/hf-token-secret created
/llm-d-deployer/quickstart$ ./llmd-installer.sh --uninstall
βΉοΈ π Setting up script environment...
βΉοΈ kubectl can reach to a running Kubernetes cluster.
βΉοΈ ποΈ Tearing down GAIE Kubernetes infrastructureβ¦
β
π Base CRDs: Deleting...
Error from server (NotFound): error when deleting "https://github.com/llm-d/llm-d-inference-scheduler/deploy/components/crds-gateway-api": customresourcedefinitions.apiextensions.k8s.io "gatewayclasses.gateway.networking.k8s.io" not found
Error from server (NotFound): error when deleting "https://github.com/llm-d/llm-d-inference-scheduler/deploy/components/crds-gateway-api": customresourcedefinitions.apiextensions.k8s.io "gateways.gateway.networking.k8s.io" not found
Error from server (NotFound): error when deleting "https://github.com/llm-d/llm-d-inference-scheduler/deploy/components/crds-gateway-api": customresourcedefinitions.apiextensions.k8s.io "grpcroutes.gateway.networking.k8s.io" not found
Error from server (NotFound): error when deleting "https://github.com/llm-d/llm-d-inference-scheduler/deploy/components/crds-gateway-api": customresourcedefinitions.apiextensions.k8s.io "httproutes.gateway.networking.k8s.io" not found
Error from server (NotFound): error when deleting "https://github.com/llm-d/llm-d-inference-scheduler/deploy/components/crds-gateway-api": customresourcedefinitions.apiextensions.k8s.io "referencegrants.gateway.networking.k8s.io" not found
β
πͺ GAIE CRDs: Deleting...
Error from server (NotFound): error when deleting "https://github.com/llm-d/llm-d-inference-scheduler/deploy/components/crds-gie": customresourcedefinitions.apiextensions.k8s.io "inferencemodels.inference.networking.x-k8s.io" not found
Error from server (NotFound): error when deleting "https://github.com/llm-d/llm-d-inference-scheduler/deploy/components/crds-gie": customresourcedefinitions.apiextensions.k8s.io "inferencepools.inference.networking.x-k8s.io" not found
β
π Gateway provider 'kgateway': Deleting...
release "kgateway" uninstalled
release "kgateway-crds" uninstalled
Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway-crds:v2.0.0
Digest: sha256:b405a0fbca50ae816bba355f1133cb456f280d9925d824166b7b6fc4e96f2077
βΉοΈ ποΈ Uninstalling llm-d chart...
release "llm-d" uninstalled
βΉοΈ ποΈ Deleting namespace llm-d...
namespace "llm-d" deleted
βΉοΈ ποΈ Deleting monitoring namespace...
βΉοΈ ποΈ Deleting ClusterRoleBinding llm-d
No resources found
β
π Uninstallation complete
ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$ ls examples/
all-features/ base/ kvcache/ llama4-fp8.yaml no-features/ pd-nixl/
ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$ ls examples/base/
base.yaml slim/
ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$ ls examples/base/base.yaml ^C
ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$ ls examples/kvcache/kvcache.yaml ^C
ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$
ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$ ./llmd-installer.sh --values-file examples/kvcache/kvcache.yaml --minikube
βΉοΈ π Setting up script environment...
βΉοΈ kubectl can reach to a running Kubernetes cluster.
β
HF_TOKEN validated
βΉοΈ ποΈ Installing GAIE Kubernetes infrastructureβ¦
β
π Base CRDs: Installing...
customresourcedefinition.apiextensions.k8s.io/gatewayclasses.gateway.networking.k8s.io created
customresourcedefinition.apiextensions.k8s.io/gateways.gateway.networking.k8s.io created
customresourcedefinition.apiextensions.k8s.io/grpcroutes.gateway.networking.k8s.io created
customresourcedefinition.apiextensions.k8s.io/httproutes.gateway.networking.k8s.io created
customresourcedefinition.apiextensions.k8s.io/referencegrants.gateway.networking.k8s.io created
β
πͺ GAIE CRDs: Installing...
customresourcedefinition.apiextensions.k8s.io/inferencemodels.inference.networking.x-k8s.io created
customresourcedefinition.apiextensions.k8s.io/inferencepools.inference.networking.x-k8s.io created
β
π Gateway provider 'kgateway': Installing...
Release "kgateway-crds" does not exist. Installing it now.
Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway-crds:v2.0.0
Digest: sha256:b405a0fbca50ae816bba355f1133cb456f280d9925d824166b7b6fc4e96f2077
NAME: kgateway-crds
LAST DEPLOYED: Fri May 30 20:27:50 2025
NAMESPACE: kgateway-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
Release "kgateway" does not exist. Installing it now.
Pulled: cr.kgateway.dev/kgateway-dev/charts/kgateway:v2.0.0
Digest: sha256:bbd7559eaa05ef6c27382390768889f5475e75bdcb4bd81ebd0f770cd14ab7a8
NAME: kgateway
LAST DEPLOYED: Fri May 30 20:27:51 2025
NAMESPACE: kgateway-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
β
GAIE infra applied
βΉοΈ π¦ Creating namespace llm-d...
namespace/llm-d created
Context "minikube" modified.
β
Namespace ready
βΉοΈ πΉ Using merged values: /tmp/tmp.s8hh00P0yh
βΉοΈ π Creating/updating HF token secret...
secret/llm-d-hf-token created
β
HF token secret created
βΉοΈ Fetching OCP proxy UID...
βΉοΈ No OpenShift SCC annotation found; defaulting PROXY_UID=0
βΉοΈ π Applying modelservice CRD...
customresourcedefinition.apiextensions.k8s.io/modelservices.llm-d.ai unchanged
β
ModelService CRD applied
βΉοΈ βοΈ Model download to PVC skipped: BYO model via HF repo_id selected.
protocol hf chosen - models will be downloaded JIT in inferencing pods.
"bitnami" already exists with the same configuration, skipping
βΉοΈ π οΈ Building Helm chart dependencies...
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "prometheus-community" chart repository
...Successfully got an update from the "bitnami" chart repository
Update Complete. βHappy Helming!β
Saving 2 charts
Downloading common from repo https://charts.bitnami.com/bitnami
Downloading redis from repo https://charts.bitnami.com/bitnami
Pulled: registry-1.docker.io/bitnamicharts/redis:20.13.4
Digest: sha256:6a389e13237e8e639ec0d445e785aa246b57bfce711b087033a196a291d5c8d7
Deleting outdated charts
β
Dependencies built
βΉοΈ π Checking for ServiceMonitor CRD (monitoring.coreos.com)...
βΉοΈ
Your release is named llm-d
.
To learn more about the release, try:
$ helm status llm-d
$ helm get all llm-d
Following presets are available to your users:
Name | Description |
---|---|
basic-gpu-preset | Basic gpu inference |
basic-gpu-with-nixl-preset | GPU inference with NIXL P/D KV transfer and cache offloading |
basic-gpu-with-nixl-and-redis-lookup-preset | GPU inference with NIXL P/D KV transfer, cache offloading and Redis lookup server |
basic-sim-preset | Basic simulation |
β llm-d deployed | |
β π Installation complete. | |
ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$ ./^Cn-bench.sh --model meta-llama/Llama-3.2-3B-Instruct --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --dataset-name random --input-len 1000 --output-len 500 --request-rates 10,30,inf --metadata "deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500" --result-file results.json | |
ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$ | |
ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$ | |
ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$ | |
ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$ | |
ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$ | |
ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$ kubectl get pods # --all-namespaces | |
NAME READY STATUS RESTARTS AGE | |
llm-d-inference-gateway-5fbd8c566-htfz5 1/1 Running 0 49s | |
llm-d-modelservice-5757d7b578-zgr4g 1/1 Running 0 50s | |
meta-llama-llama-3-2-3b-instruct-decode-7bf457bdcc-24dc9 2/2 Running 0 48s | |
meta-llama-llama-3-2-3b-instruct-decode-7bf457bdcc-dxpb4 2/2 Running 0 48s | |
meta-llama-llama-3-2-3b-instruct-decode-7bf457bdcc-qm4gs 2/2 Running 0 48s | |
meta-llama-llama-3-2-3b-instruct-decode-7bf457bdcc-xpp6n 2/2 Running 0 48s | |
meta-llama-llama-3-2-3b-instruct-epp-555969c945-j8h74 1/1 Running 0 48s | |
ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$ ./test-request.sh --minikube | |
Namespace: llm-d | |
Model ID: none; will be discover from first entry in /v1/models |
Minikube validation: hitting gateway DNS at llm-d-inference-gateway.llm-d.svc.cluster.local:80 1 -> GET /v1/models via DNS at llm-d-inference-gateway.llm-d.svc.cluster.local:80β¦ pod "curl-2965" deleted
ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$ ./test-request.sh --minikube Namespace: llm-d Model ID: none; will be discover from first entry in /v1/models
Minikube validation: hitting gateway DNS at llm-d-inference-gateway.llm-d.svc.cluster.local:80
1 -> GET /v1/models via DNS at llm-d-inference-gateway.llm-d.svc.cluster.local:80β¦
error: timed out waiting for the condition
ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$ ./^Cmd-installer.sh --values-file examples/kvcache/kvcache.yaml --minikube
ubuntu@ip-172-31-16-33:/llm-d-deployer/quickstart$
ubuntu@ip-172-31-16-33:~/llm-d-deployer/quickstart$ ./run-bench.sh --model meta-llama/Llama-3.2-3B-Instruct
--base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80
--dataset-name random
--input-len 1000
--output-len 500
--request-rates 10,30,inf
--metadata "deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500"
--result-file results.json
secret/hf-token-secret created
βΆοΈ Benchmarking MODEL=meta-llama/Llama-3.2-3B-Instruct at rates: 10 30 inf QPS for 30 seconds each
π Results will go into ./results.json
π Launching vllm-bench-job-10qps (QPS=10, prompts=300)β¦
job.batch/vllm-bench-job-10qps created
job.batch/vllm-bench-job-10qps condition met
π Logs from vllm-bench-job-10qps:
Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN
Starting benchmark at Fri May 30 20:37:06 UTC 2025
----- ENV VARS -----
BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80
MODEL=meta-llama/Llama-3.2-3B-Instruct
DATASET_NAME=random
RANDOM_INPUT_LEN=1000
RANDOM_OUTPUT_LEN=500
REQUEST_RATE=10
NUM_PROMPTS=300
IGNORE_EOS=true
RESULT_FILENAME=results.json
METADATA=deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500
Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate 10 --num-prompts 300 --save-result --result-filename results.json --metadata deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 20:37:15 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 20:37:15 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 20:37:15 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 20:37:15 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 20:37:15 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 20:37:15 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 20:37:15 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 20:37:15 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 20:37:17 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=300, logprobs=None, request_rate=10.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=base', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: 10.0 Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|ββββββββββ| 300/300 [00:36<00:00, 8.33it/s] ============ Serving Benchmark Result ============ Successful requests: 300 Benchmark duration (s): 36.03 Total input tokens: 299700 Total generated tokens: 81050 Request throughput (req/s): 8.33 Output token throughput (tok/s): 2249.25 Total Token throughput (tok/s): 10566.33 ---------------Time to First Token---------------- Mean TTFT (ms): 54.58 Median TTFT (ms): 53.00 P99 TTFT (ms): 89.77 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 13.58 Median TPOT (ms): 13.51 P99 TPOT (ms): 17.65 ---------------Inter-token Latency---------------- Mean ITL (ms): 13.51 Median ITL (ms): 12.84 P99 ITL (ms): 34.21
<<<RESULT_START>>> {"date": "20250530-203806", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 300, "deployment": "base", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 10.0, "burstiness": 1.0, "max_concurrency": null, "duration": 36.034275877000255, "completed": 300, "total_input_tokens": 299700, "total_output_tokens": 81050, "request_throughput": 8.325406649602808, "request_goodput:": null, "output_throughput": 2249.247363167692, "total_token_throughput": 10566.328606120898, "mean_ttft_ms": 54.577085783294024, "median_ttft_ms": 52.999333499883505, "std_ttft_ms": 9.27197373989555, "p99_ttft_ms": 89.76647855975897, "mean_tpot_ms": 13.577995860227945, "median_tpot_ms": 13.512526767533997, "std_tpot_ms": 0.9199628548912276, "p99_tpot_ms": 17.652194109258673, "mean_itl_ms": 13.505887271504633, "median_itl_ms": 12.841573499827064, "std_itl_ms": 3.839438281316489, "p99_itl_ms": 34.210415409961556} <<<RESULT_END>>> Appended results block for 10 QPS Cleaning up Job vllm-bench-job-10qps... job.batch "vllm-bench-job-10qps" deleted π Launching vllm-bench-job-30qps (QPS=30, prompts=900)β¦ job.batch/vllm-bench-job-30qps created job.batch/vllm-bench-job-30qps condition met π Logs from vllm-bench-job-30qps: Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN Starting benchmark at Fri May 30 20:38:10 UTC 2025 ----- ENV VARS ----- BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 MODEL=meta-llama/Llama-3.2-3B-Instruct DATASET_NAME=random RANDOM_INPUT_LEN=1000 RANDOM_OUTPUT_LEN=500 REQUEST_RATE=30 NUM_PROMPTS=900 IGNORE_EOS=true RESULT_FILENAME=results.json METADATA=deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500
Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate 30 --num-prompts 900 --save-result --result-filename results.json --metadata deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 20:38:19 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 20:38:19 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 20:38:19 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 20:38:19 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 20:38:19 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 20:38:19 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 20:38:19 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 20:38:19 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 20:38:20 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=900, logprobs=None, request_rate=30.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=base', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: 30.0 Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|ββββββββββ| 900/900 [00:40<00:00, 22.18it/s] ============ Serving Benchmark Result ============ Successful requests: 900 Benchmark duration (s): 40.57 Total input tokens: 899100 Total generated tokens: 239959 Request throughput (req/s): 22.18 Output token throughput (tok/s): 5914.31 Total Token throughput (tok/s): 28074.59 ---------------Time to First Token---------------- Mean TTFT (ms): 73.23 Median TTFT (ms): 74.51 P99 TTFT (ms): 169.39 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 23.59 Median TPOT (ms): 24.65 P99 TPOT (ms): 32.72 ---------------Inter-token Latency---------------- Mean ITL (ms): 23.47 Median ITL (ms): 20.99 P99 ITL (ms): 75.82
<<<RESULT_START>>> {"date": "20250530-203916", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "base", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": 30.0, "burstiness": 1.0, "max_concurrency": null, "duration": 40.57259980900017, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 239959, "request_throughput": 22.182458216551215, "request_goodput:": null, "output_throughput": 5914.311656872681, "total_token_throughput": 28074.587415207345, "mean_ttft_ms": 73.22729984779572, "median_ttft_ms": 74.51430299988715, "std_ttft_ms": 33.92979932231269, "p99_ttft_ms": 169.39216242027214, "mean_tpot_ms": 23.586269942687313, "median_tpot_ms": 24.64663913131863, "std_tpot_ms": 5.4388490325372665, "p99_tpot_ms": 32.724085161084425, "mean_itl_ms": 23.466772128068868, "median_itl_ms": 20.98881700021593, "std_itl_ms": 11.048866594118087, "p99_itl_ms": 75.82270559991227} <<<RESULT_END>>> Appended results block for 30 QPS Cleaning up Job vllm-bench-job-30qps... job.batch "vllm-bench-job-30qps" deleted π Launching vllm-bench-job-inf (infinite QPS, prompts=900)β¦ job.batch/vllm-bench-job-inf created job.batch/vllm-bench-job-inf condition met π Logs from vllm-bench-job-inf: Using HF_TOKEN as HUGGINGFACE_HUB_TOKEN Starting benchmark at Fri May 30 20:39:20 UTC 2025 ----- ENV VARS ----- BASE_URL=http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 MODEL=meta-llama/Llama-3.2-3B-Instruct DATASET_NAME=random RANDOM_INPUT_LEN=1000 RANDOM_OUTPUT_LEN=500 REQUEST_RATE=inf NUM_PROMPTS=900 IGNORE_EOS=true RESULT_FILENAME=results.json METADATA=deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500
Running: python /opt/benchmark/vllm/benchmarks/benchmark_serving.py --base_url http://llm-d-inference-gateway.llm-d.svc.cluster.local:80 --model meta-llama/Llama-3.2-3B-Instruct --dataset-name random --random-input-len 1000 --random-output-len 500 --request-rate inf --num-prompts 900 --save-result --result-filename results.json --metadata deployment=base gpu=4xNVIDIA_L40S model=meta-llama/Llama-3.2-3B-Instruct gateway=kgateway prefill_replicas=0 decode_replicas=4 input_len=1000 output_len=500 WARNING 05-30 20:39:29 [init.py:221] Platform plugin tpu function's return value is None WARNING 05-30 20:39:30 [init.py:221] Platform plugin cuda function's return value is None WARNING 05-30 20:39:30 [init.py:221] Platform plugin rocm function's return value is None WARNING 05-30 20:39:30 [init.py:221] Platform plugin hpu function's return value is None WARNING 05-30 20:39:30 [init.py:221] Platform plugin xpu function's return value is None WARNING 05-30 20:39:30 [init.py:221] Platform plugin cpu function's return value is None WARNING 05-30 20:39:30 [init.py:221] Platform plugin neuron function's return value is None INFO 05-30 20:39:30 [init.py:250] No platform detected, vLLM is running on UnspecifiedPlatform WARNING 05-30 20:39:31 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') Namespace(backend='vllm', base_url='http://llm-d-inference-gateway.llm-d.svc.cluster.local:80', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-3B-Instruct', tokenizer=None, use_beam_search=False, num_prompts=900, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=['deployment=base', 'gpu=4xNVIDIA_L40S', 'model=meta-llama/Llama-3.2-3B-Instruct', 'gateway=kgateway', 'prefill_replicas=0', 'decode_replicas=4', 'input_len=1000', 'output_len=500'], result_dir=None, result_filename='results.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1000, random_output_len=500, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: inf Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|ββββββββββ| 900/900 [00:20<00:00, 43.02it/s] ============ Serving Benchmark Result ============ Successful requests: 900 Benchmark duration (s): 20.92 Total input tokens: 899100 Total generated tokens: 240163 Request throughput (req/s): 43.02 Output token throughput (tok/s): 11481.02 Total Token throughput (tok/s): 54462.58 ---------------Time to First Token---------------- Mean TTFT (ms): 902.25 Median TTFT (ms): 877.08 P99 TTFT (ms): 1353.17 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 44.36 Median TPOT (ms): 43.74 P99 TPOT (ms): 67.41 ---------------Inter-token Latency---------------- Mean ITL (ms): 40.13 Median ITL (ms): 36.94 P99 ITL (ms): 66.37
<<<RESULT_START>>> {"date": "20250530-204006", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-3B-Instruct", "tokenizer_id": "meta-llama/Llama-3.2-3B-Instruct", "num_prompts": 900, "deployment": "base", "gpu": "4xNVIDIA_L40S", "model": "meta-llama/Llama-3.2-3B-Instruct", "gateway": "kgateway", "prefill_replicas": "0", "decode_replicas": "4", "input_len": "1000", "output_len": "500", "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 20.918269691000205, "completed": 900, "total_input_tokens": 899100, "total_output_tokens": 240163, "request_throughput": 43.024591101204344, "request_goodput:": null, "output_throughput": 11481.016525153933, "total_token_throughput": 54462.58303525708, "mean_ttft_ms": 902.2531994799939, "median_ttft_ms": 877.0819514993491, "std_ttft_ms": 194.26376120632074, "p99_ttft_ms": 1353.1694669699937, "mean_tpot_ms": 44.35695594492121, "median_tpot_ms": 43.742781012684844, "std_tpot_ms": 7.88600060931539, "p99_tpot_ms": 67.41057380823649, "mean_itl_ms": 40.126101133447385, "median_itl_ms": 36.940078000043286, "std_itl_ms": 8.77480669076588, "p99_itl_ms": 66.37176537940839} <<<RESULT_END>>> Appended results block for infinite QPS Cleaning up Job vllm-bench-job-inf... job.batch "vllm-bench-job-inf" deleted β All benchmarks complete. Combined results in ./results.json