vLLM Inference Simulator

1 . Start the sim container

podman || docker run --rm --net host ghcr.io/llm-d/llm-d-inference-sim \
  --port 8000 \
  --model "Qwen/Qwen2.5-1.5B-Instruct" \
  --lora  "tweet-summary-0,tweet-summary-1"

Example Output

I0701 20:17:13.277787       1 cmd.go:36] "Start vllm simulator"
I0701 20:17:13.278045       1 simulator.go:136] "Server starting" port=8000
I0701 20:18:52.785435       1 simulator.go:199] "chat completion request received"
I0701 20:18:52.785584       1 simulator.go:316] "Create reference counter" model="tweet-summary-0"
I0701 20:18:52.785600       1 simulator.go:322] "Update LoRA reference counter" model="tweet-summary-0" old value=0 new value=1
I0701 20:18:52.785697       1 simulator.go:378] "Remove LoRA from set of running loras" model="tweet-summary-0"
I0701 20:20:04.106332       1 simulator.go:199] "chat completion request received"
I0701 20:20:04.106409       1 simulator.go:316] "Create reference counter" model="tweet-summary-0"
I0701 20:20:04.106426       1 simulator.go:322] "Update LoRA reference counter" model="tweet-summary-0" old value=0 new value=1
I0701 20:20:04.106469       1 simulator.go:378] "Remove LoRA from set of running loras" model="tweet-summary-0"
I0701 20:22:46.464957       1 simulator.go:199] "chat completion request received"
I0701 20:22:46.465162       1 streaming.go:42] "Going to send text" resp body="Today it is partially cloudy and raining." tokens num=7
I0701 20:24:22.093467       1 simulator.go:199] "chat completion request received"

The image boots a HTTP server on localhost:8000 that speaks OpenAI-style /v1/* endpoints and advertises one base model plus two LoRA adapters.

2 . Verify what models are live

curl http://localhost:8000/v1/models | jq .

Sample response

{
  "object": "list",
  "data": [
    {
      "id": "Qwen/Qwen2.5-1.5B-Instruct",
      "object": "model",
      "created": 1751401669,
      "owned_by": "vllm",
      "root": "Qwen/Qwen2.5-1.5B-Instruct",
      "parent": null
    },
    {
      "id": "tweet-summary-0",
      "object": "model",
      "created": 1751401669,
      "owned_by": "vllm",
      "root": "tweet-summary-0",
      "parent": "Qwen/Qwen2.5-1.5B-Instruct"
    },
    {
      "id": "tweet-summary-1",
      "object": "model",
      "created": 1751401669,
      "owned_by": "vllm",
      "root": "tweet-summary-1",
      "parent": "Qwen/Qwen2.5-1.5B-Instruct"
    }
  ]
}

3 . Chat with the base model

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "Qwen/Qwen2.5-1.5B-Instruct",
        "messages": [
          { "role": "system", "content": "You are a helpful assistant." },
          { "role": "user",   "content": "Give me three fun facts about red pandas." }
        ],
        "temperature": 0.7,
        "max_tokens": 256
      }' | jq .

Expected: a JSON payload with the assistant’s reply in choices[0].message.content plus token-usage stats.

Sample response

{
  "id": "chatcmpl-b8020775-d003-487c-b51f-6943facde5a7",
  "created": 1751401855,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "usage": {
    "prompt_tokens": 13,
    "completion_tokens": 7,
    "total_tokens": 20
  },
  "object": "chat.completion",
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": "Today it is partially cloudy and raining."
      }
    }
  ]
}

4 . Use a LoRA adapter for tweet summarization

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "tweet-summary-0",
        "messages": [
          { "role": "system", "content": "You summarize tweets in one sentence." },
          { "role": "user",   "content": "Pretty sure AI won’t end humanity, it’ll just auto-schedule our existential crisis" }
        ],
        "temperature": 0.3,
        "max_tokens": 64
      }' | jq .

Swap "tweet-summary-0" for "tweet-summary-1" to compare adapters.

5 . Watch tokens stream live

curl -N http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
        "model": "Qwen/Qwen2.5-1.5B-Instruct",
        "stream": true,
        "messages": [
          { "role": "user", "content": "Attention Mechanism in LLMs 100 words." }
        ]
      }'

Sample response

$ curl -N http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "Qwen/Qwen2.5-1.5B-Instruct",
        "stream": true,
        "messages": [
          { "role": "user", "content": "Attention Mechanism in LLMs 100 words." }
        ]
      }'
data: {"id":"chatcmpl-a783655b-3853-46ed-939f-69563a685627","created":1751401366,"model":"Qwen/Qwen2.5-1.5B-Instruct","usage":null,"object":"chat.completion.chunk","choices":[{"index":0,"finish_reason":null,"delta":{"role":"assistant"}}]}

data: {"id":"chatcmpl-2ea29f55-9865-4ad6-95d2-5d115b55e47a","created":1751401366,"model":"Qwen/Qwen2.5-1.5B-Instruct","usage":null,"object":"chat.completion.chunk","choices":[{"index":0,"finish_reason":null,"delta":{"content":"Today"}}]}

data: {"id":"chatcmpl-c853009b-fa1e-4ddc-a760-97e25b3c251e","created":1751401366,"model":"Qwen/Qwen2.5-1.5B-Instruct","usage":null,"object":"chat.completion.chunk","choices":[{"index":0,"finish_reason":null,"delta":{"content":" it"}}]}

data: {"id":"chatcmpl-3c5afb84-54ad-4e45-ae80-14f9e97f4f3f","created":1751401366,"model":"Qwen/Qwen2.5-1.5B-Instruct","usage":null,"object":"chat.completion.chunk","choices":[{"index":0,"finish_reason":null,"delta":{"content":" is"}}]}

data: {"id":"chatcmpl-e7bd1fde-28f6-4264-8b75-c58a66c05e12","created":1751401366,"model":"Qwen/Qwen2.5-1.5B-Instruct","usage":null,"object":"chat.completion.chunk","choices":[{"index":0,"finish_reason":null,"delta":{"content":" partially"}}]}

data: {"id":"chatcmpl-7c887673-98b0-4c7c-86c6-cc1a0c4852cb","created":1751401366,"model":"Qwen/Qwen2.5-1.5B-Instruct","usage":null,"object":"chat.completion.chunk","choices":[{"index":0,"finish_reason":null,"delta":{"content":" cloudy"}}]}

data: {"id":"chatcmpl-ff237ed7-6086-4723-8d45-f317b149483c","created":1751401366,"model":"Qwen/Qwen2.5-1.5B-Instruct","usage":null,"object":"chat.completion.chunk","choices":[{"index":0,"finish_reason":null,"delta":{"content":" and"}}]}

data: {"id":"chatcmpl-d364d700-eacb-4e56-bbed-238ab67c0bca","created":1751401366,"model":"Qwen/Qwen2.5-1.5B-Instruct","usage":null,"object":"chat.completion.chunk","choices":[{"index":0,"finish_reason":null,"delta":{"content":" raining."}}]}

data: {"id":"chatcmpl-d3bf810d-3fd1-435b-8cb1-4f3591dd31ad","created":1751401366,"model":"Qwen/Qwen2.5-1.5B-Instruct","usage":null,"object":"chat.completion.chunk","choices":[{"index":0,"finish_reason":"stop","delta":{}}]}

data: [DONE]

Summary

The examples were:

Base model experimentation
LoRA-specific prompts & metrics
Streaming token demos

nerdalert/vllm-sim.md

vLLM Inference Simulator

1 . Start the sim container

2 . Verify what models are live

3 . Chat with the base model

4 . Use a LoRA adapter for tweet summarization

5 . Watch tokens stream live

Summary