Skip to content

Instantly share code, notes, and snippets.

@surajssd
Created March 5, 2025 21:54
Show Gist options
  • Select an option

  • Save surajssd/c8bbcb244b210a3607ad952a7bdda759 to your computer and use it in GitHub Desktop.

Select an option

Save surajssd/c8bbcb244b210a3607ad952a7bdda759 to your computer and use it in GitHub Desktop.
[{"Test name": "serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp2_sharegpt_qps_01", "GPU": "1xStandard_ND96asr_v4 x 2", "# of req.": 200, "Tput (req/s)": 0.9284057358744006, "Output Tput (tok/s)": 198.24247678126076, "Total Tput (tok/s)": 396.266778214591, "Mean TTFT (ms)": 110.37337160010793, "Median TTFT (ms)": 96.9816950000677, "P99 TTFT (ms)": 230.3005734290491, "Mean TPOT (ms)": 43.72182021034344, "Median TPOT (ms)": 43.54532462942404, "P99 TPOT (ms)": 50.513716590712384, "Mean ITL (ms)": 43.631314270832306, "Median ITL (ms)": 42.27557599915599, "P99 ITL (ms)": 87.99811164881247}, {"Test name": "serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp2_sharegpt_qps_04", "GPU": "1xStandard_ND96asr_v4 x 2", "# of req.": 200, "Tput (req/s)": 2.521471685463534, "Output Tput (tok/s)": 539.0528242768216, "Total Tput (tok/s)": 1076.8701274277662, "Mean TTFT (ms)": 139.8380736899344, "Median TTFT (ms)": 125.15622350110789, "P99 TTFT (ms)": 332.96458055017825, "Mean TPOT (ms)": 61.62705314762229, "Median TPOT (ms)": 63.49695762410795, "P99 TPOT (ms)": 83.14804765725845, "Mean ITL (ms)": 60.90314970594733, "Median ITL (ms)": 57.82372999965446, "P99 ITL (ms)": 173.3455703589425}, {"Test name": "serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp2_sharegpt_qps_16", "GPU": "1xStandard_ND96asr_v4 x 2", "# of req.": 200, "Tput (req/s)": 3.697917701235066, "Output Tput (tok/s)": 791.8905861309833, "Total Tput (tok/s)": 1580.6379422159166, "Mean TTFT (ms)": 226.7617405450983, "Median TTFT (ms)": 215.64252300049702, "P99 TTFT (ms)": 479.7536375109121, "Mean TPOT (ms)": 87.57618569686481, "Median TPOT (ms)": 81.37176336238505, "P99 TPOT (ms)": 153.84377854204757, "Mean ITL (ms)": 72.72835843662789, "Median ITL (ms)": 64.36212599874125, "P99 ITL (ms)": 233.4853582404321}, {"Test name": "serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp2_sharegpt_qps_inf", "GPU": "1xStandard_ND96asr_v4 x 2", "# of req.": 200, "Tput (req/s)": 4.132880392447687, "Output Tput (tok/s)": 880.9027912482622, "Total Tput (tok/s)": 1762.4255145553916, "Mean TTFT (ms)": 2683.6246253499667, "Median TTFT (ms)": 2771.161826000025, "P99 TTFT (ms)": 4838.2172842909495, "Mean TPOT (ms)": 114.09317725452917, "Median TPOT (ms)": 79.11987648951599, "P99 TPOT (ms)": 400.5428015899088, "Mean ITL (ms)": 71.80776212203955, "Median ITL (ms)": 64.29544500133488, "P99 ITL (ms)": 399.1085860384919}]

Latency tests

  • Input length: 32 tokens.
  • Output length: 128 tokens.
  • Batch size: fixed (8).
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • Evaluation metrics: end-to-end latency (mean, median, p99).

Throughput tests

  • Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 200 prompts.
  • Batch size: dynamically determined by vllm to achieve maximum throughput.
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • Evaluation metrics: throughput.

Serving tests

  • Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
  • Output length: the corresponding output length of these 200 prompts.
  • Batch size: dynamically determined by vllm and the arrival pattern of the requests.
  • Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
  • Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
  • We also added a speculative decoding test for llama-3 70B, under QPS 2
  • Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Test name GPU # of req. Tput (req/s) Output Tput (tok/s) Total Tput (tok/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms) Mean TPOT (ms) Median TPOT (ms) P99 TPOT (ms) Mean ITL (ms) Median ITL (ms) P99 ITL (ms)
serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp2_sharegpt_qps_01 1xStandard_ND96asr_v4 x 2 200 0.928406 198.242 396.267 110.373 96.9817 230.301 43.7218 43.5453 50.5137 43.6313 42.2756 87.9981
serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp2_sharegpt_qps_04 1xStandard_ND96asr_v4 x 2 200 2.52147 539.053 1076.87 139.838 125.156 332.965 61.6271 63.497 83.148 60.9031 57.8237 173.346
serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp2_sharegpt_qps_16 1xStandard_ND96asr_v4 x 2 200 3.69792 791.891 1580.64 226.762 215.643 479.754 87.5762 81.3718 153.844 72.7284 64.3621 233.485
serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp2_sharegpt_qps_inf 1xStandard_ND96asr_v4 x 2 200 4.13288 880.903 1762.43 2683.62 2771.16 4838.22 114.093 79.1199 400.543 71.8078 64.2954 399.109

json version of the benchmarking tables

This section contains the data of the markdown tables above in JSON format. You can load the benchmarking tables into pandas dataframes as follows:

import json
import pandas as pd

benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])

The json string for all benchmarking tables:

{"latency": {}, "throughput": {}, "serving": {"Test name": {"0": "serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp2_sharegpt_qps_inf", "1": "serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp2_sharegpt_qps_01", "2": "serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp2_sharegpt_qps_04", "3": "serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp2_sharegpt_qps_16"}, "GPU": {"0": "Standard_ND96asr_v4 x 2", "1": "Standard_ND96asr_v4 x 2", "2": "Standard_ND96asr_v4 x 2", "3": "Standard_ND96asr_v4 x 2"}, "# of req.": {"0": 200, "1": 200, "2": 200, "3": 200}, "Tput (req/s)": {"0": 4.132880392447687, "1": 0.9284057358744006, "2": 2.521471685463534, "3": 3.697917701235066}, "Output Tput (tok/s)": {"0": 880.9027912482622, "1": 198.24247678126076, "2": 539.0528242768216, "3": 791.8905861309833}, "Total Tput (tok/s)": {"0": 1762.4255145553916, "1": 396.266778214591, "2": 1076.8701274277662, "3": 1580.6379422159166}, "Mean TTFT (ms)": {"0": 2683.6246253499667, "1": 110.37337160010793, "2": 139.8380736899344, "3": 226.7617405450983}, "Median TTFT (ms)": {"0": 2771.161826000025, "1": 96.9816950000677, "2": 125.15622350110789, "3": 215.64252300049702}, "P99 TTFT (ms)": {"0": 4838.2172842909495, "1": 230.3005734290491, "2": 332.96458055017825, "3": 479.7536375109121}, "Mean TPOT (ms)": {"0": 114.09317725452917, "1": 43.72182021034344, "2": 61.62705314762229, "3": 87.57618569686481}, "Median TPOT (ms)": {"0": 79.11987648951599, "1": 43.54532462942404, "2": 63.49695762410795, "3": 81.37176336238505}, "P99 TPOT (ms)": {"0": 400.5428015899088, "1": 50.513716590712384, "2": 83.14804765725845, "3": 153.84377854204757}, "Mean ITL (ms)": {"0": 71.80776212203955, "1": 43.631314270832306, "2": 60.90314970594733, "3": 72.72835843662789}, "Median ITL (ms)": {"0": 64.29544500133488, "1": 42.27557599915599, "2": 57.82372999965446, "3": 64.36212599874125}, "P99 ITL (ms)": {"0": 399.1085860384919, "1": 87.99811164881247, "2": 173.3455703589425, "3": 233.4853582404321}}}

You can also check the raw experiment data in the Artifact tab of the Buildkite page.

# Instructions are here: https://github.com/surajssd/llm-k8s/blob/9271454bc5a008a437c7b52c33409b18d6cb2220/configs/llama-3-3-70b-instruct/two-nodes-eight-gpus
git clone https://github.com/surajssd/llm-k8s
cd llm-k8s
git checkout 9271454bc5a008a437c7b52c33409b18d6cb2220
source .env
export VM_SIZE="Standard_ND96asr_v4"
export GPU_NODE_COUNT=2
export AZURE_REGION=southcentralus
./scripts/deploy-aks.sh deploy_aks
./scripts/deploy-aks.sh download_aks_credentials
./scripts/deploy-aks.sh install_kube_prometheus
./scripts/deploy-aks.sh install_lws_controller
./scripts/deploy-aks.sh add_nodepool
./scripts/deploy-aks.sh install_network_operator
./scripts/deploy-aks.sh install_gpu_operator
export HF_TOKEN=""
kubectl create secret generic hf-token-secret --from-literal token=${HF_TOKEN}
kubectl apply -f configs/llama-3-3-70b-instruct/two-nodes-eight-gpus/k8s/
./configs/llama-3-3-70b-instruct/two-nodes-eight-gpus/fix-svc.sh
# Test if the model is deployed!
kubectl port-forward svc/llama-3-3-70b-instruct-leader 8000
# If this works, this means that the model is being served
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "meta-llama/Llama-3.3-70B-Instruct",
"messages": [
{
"role": "user",
"content": "Explain the origin of Llama the animal?"
}
]
}' | jq
# Benchmark
# Steps: https://github.com/surajssd/llm-k8s/blob/9271454bc5a008a437c7b52c33409b18d6cb2220/benchmark/vllm_upstream
kubectl create ns vllm-benchmark
kubectl -n vllm-benchmark create configmap benchmark-runner \
--from-literal=TEST_SERVER_URL="http://llama-3-3-70b-instruct-leader.default:8000" \
--from-literal=MODEL_NAME="meta-llama/Llama-3.3-70B-Instruct" \
--from-literal=TENSOR_PARALLEL_SIZE=4 \
--from-literal=PIPELINE_PARALLEL_SIZE="${GPU_NODE_COUNT}" \
--from-literal=GPU_VM_SKU="${VM_SIZE}" \
--dry-run=client -o yaml | kubectl apply -f -
kubectl -n vllm-benchmark create secret generic hf-token-secret --from-literal token=${HF_TOKEN}
kubectl apply -f benchmark/vllm_upstream/k8s/
POD_NAME=$(kubectl -n vllm-benchmark \
get pods \
-l app=benchmark-runner \
--field-selector=status.phase=Running \
-o jsonpath='{.items[].metadata.name}')
kubectl -n vllm-benchmark \
exec -it $POD_NAME \
-- bash /root/scripts/run_vllm_upstream_benchmark.sh
RESULTS_FILE=$(kubectl -n vllm-benchmark \
exec -it $POD_NAME \
-- bash -c "ls /root/results*.tar.gz" | tr -d '\r')
kubectl -n vllm-benchmark \
cp "${POD_NAME}:${RESULTS_FILE}" "./$(basename ${RESULTS_FILE})"
{
"client_command": "python3 benchmark_serving.py --save-result --base-url http://llama-3-3-70b-instruct-leader.default:8000 --result-dir /root/vllm/.buildkite/results/ --result-filename serving_meta-llama-Llama-3.3-70B-Instruct_tp4_pp2_sharegpt_qps_01.json --request-rate 01 --model=meta-llama/Llama-3.3-70B-Instruct --backend=vllm --dataset-name=sharegpt --dataset-path=/root/sharegpt.json --num-prompts=200",
"gpu_type": "Standard_ND96asr_v4 x 2"
}
View raw

(Sorry about that, but we can’t show files that are this big right now.)

View raw

(Sorry about that, but we can’t show files that are this big right now.)

View raw

(Sorry about that, but we can’t show files that are this big right now.)

View raw

(Sorry about that, but we can’t show files that are this big right now.)

View raw

(Sorry about that, but we can’t show files that are this big right now.)

View raw

(Sorry about that, but we can’t show files that are this big right now.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment