Repo at llm-d/llm-d-inference-sim
podman || docker run --rm --net host ghcr.io/llm-d/llm-d-inference-sim \
--port 8000 \
--model "Qwen/Qwen2.5-1.5B-Instruct" \
--lora "tweet-summary-0,tweet-summary-1"
Example Output
I0701 20:17:13.277787 1 cmd.go:36] "Start vllm simulator"
I0701 20:17:13.278045 1 simulator.go:136] "Server starting" port=8000
I0701 20:18:52.785435 1 simulator.go:199] "chat completion request received"
I0701 20:18:52.785584 1 simulator.go:316] "Create reference counter" model="tweet-summary-0"
I0701 20:18:52.785600 1 simulator.go:322] "Update LoRA reference counter" model="tweet-summary-0" old value=0 new value=1
I0701 20:18:52.785697 1 simulator.go:378] "Remove LoRA from set of running loras" model="tweet-summary-0"
I0701 20:20:04.106332 1 simulator.go:199] "chat completion request received"
I0701 20:20:04.106409 1 simulator.go:316] "Create reference counter" model="tweet-summary-0"
I0701 20:20:04.106426 1 simulator.go:322] "Update LoRA reference counter" model="tweet-summary-0" old value=0 new value=1
I0701 20:20:04.106469 1 simulator.go:378] "Remove LoRA from set of running loras" model="tweet-summary-0"
I0701 20:22:46.464957 1 simulator.go:199] "chat completion request received"
I0701 20:22:46.465162 1 streaming.go:42] "Going to send text" resp body="Today it is partially cloudy and raining." tokens num=7
I0701 20:24:22.093467 1 simulator.go:199] "chat completion request received"
The image boots a HTTP server on localhost:8000 that speaks OpenAI-style /v1/*
endpoints and advertises one base model plus two LoRA adapters.
curl http://localhost:8000/v1/models | jq .
Sample response
{
"object": "list",
"data": [
{
"id": "Qwen/Qwen2.5-1.5B-Instruct",
"object": "model",
"created": 1751401669,
"owned_by": "vllm",
"root": "Qwen/Qwen2.5-1.5B-Instruct",
"parent": null
},
{
"id": "tweet-summary-0",
"object": "model",
"created": 1751401669,
"owned_by": "vllm",
"root": "tweet-summary-0",
"parent": "Qwen/Qwen2.5-1.5B-Instruct"
},
{
"id": "tweet-summary-1",
"object": "model",
"created": 1751401669,
"owned_by": "vllm",
"root": "tweet-summary-1",
"parent": "Qwen/Qwen2.5-1.5B-Instruct"
}
]
}
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Give me three fun facts about red pandas." }
],
"temperature": 0.7,
"max_tokens": 256
}' | jq .
Expected: a JSON payload with the assistant’s reply in choices[0].message.content
plus token-usage stats.
Sample response
{
"id": "chatcmpl-b8020775-d003-487c-b51f-6943facde5a7",
"created": 1751401855,
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"usage": {
"prompt_tokens": 13,
"completion_tokens": 7,
"total_tokens": 20
},
"object": "chat.completion",
"choices": [
{
"index": 0,
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": "Today it is partially cloudy and raining."
}
}
]
}
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "tweet-summary-0",
"messages": [
{ "role": "system", "content": "You summarize tweets in one sentence." },
{ "role": "user", "content": "Pretty sure AI won’t end humanity, it’ll just auto-schedule our existential crisis" }
],
"temperature": 0.3,
"max_tokens": 64
}' | jq .
Swap "tweet-summary-0"
for "tweet-summary-1"
to compare adapters.
curl -N http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"stream": true,
"messages": [
{ "role": "user", "content": "Attention Mechanism in LLMs 100 words." }
]
}'
Sample response
$ curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"stream": true,
"messages": [
{ "role": "user", "content": "Attention Mechanism in LLMs 100 words." }
]
}'
data: {"id":"chatcmpl-a783655b-3853-46ed-939f-69563a685627","created":1751401366,"model":"Qwen/Qwen2.5-1.5B-Instruct","usage":null,"object":"chat.completion.chunk","choices":[{"index":0,"finish_reason":null,"delta":{"role":"assistant"}}]}
data: {"id":"chatcmpl-2ea29f55-9865-4ad6-95d2-5d115b55e47a","created":1751401366,"model":"Qwen/Qwen2.5-1.5B-Instruct","usage":null,"object":"chat.completion.chunk","choices":[{"index":0,"finish_reason":null,"delta":{"content":"Today"}}]}
data: {"id":"chatcmpl-c853009b-fa1e-4ddc-a760-97e25b3c251e","created":1751401366,"model":"Qwen/Qwen2.5-1.5B-Instruct","usage":null,"object":"chat.completion.chunk","choices":[{"index":0,"finish_reason":null,"delta":{"content":" it"}}]}
data: {"id":"chatcmpl-3c5afb84-54ad-4e45-ae80-14f9e97f4f3f","created":1751401366,"model":"Qwen/Qwen2.5-1.5B-Instruct","usage":null,"object":"chat.completion.chunk","choices":[{"index":0,"finish_reason":null,"delta":{"content":" is"}}]}
data: {"id":"chatcmpl-e7bd1fde-28f6-4264-8b75-c58a66c05e12","created":1751401366,"model":"Qwen/Qwen2.5-1.5B-Instruct","usage":null,"object":"chat.completion.chunk","choices":[{"index":0,"finish_reason":null,"delta":{"content":" partially"}}]}
data: {"id":"chatcmpl-7c887673-98b0-4c7c-86c6-cc1a0c4852cb","created":1751401366,"model":"Qwen/Qwen2.5-1.5B-Instruct","usage":null,"object":"chat.completion.chunk","choices":[{"index":0,"finish_reason":null,"delta":{"content":" cloudy"}}]}
data: {"id":"chatcmpl-ff237ed7-6086-4723-8d45-f317b149483c","created":1751401366,"model":"Qwen/Qwen2.5-1.5B-Instruct","usage":null,"object":"chat.completion.chunk","choices":[{"index":0,"finish_reason":null,"delta":{"content":" and"}}]}
data: {"id":"chatcmpl-d364d700-eacb-4e56-bbed-238ab67c0bca","created":1751401366,"model":"Qwen/Qwen2.5-1.5B-Instruct","usage":null,"object":"chat.completion.chunk","choices":[{"index":0,"finish_reason":null,"delta":{"content":" raining."}}]}
data: {"id":"chatcmpl-d3bf810d-3fd1-435b-8cb1-4f3591dd31ad","created":1751401366,"model":"Qwen/Qwen2.5-1.5B-Instruct","usage":null,"object":"chat.completion.chunk","choices":[{"index":0,"finish_reason":"stop","delta":{}}]}
data: [DONE]
The examples were:
- Base model experimentation
- LoRA-specific prompts & metrics
- Streaming token demos