Skip to content

Instantly share code, notes, and snippets.

@nerdalert
Last active May 3, 2025 16:24
Show Gist options
  • Save nerdalert/5b25217492b0b928abea043121eed11d to your computer and use it in GitHub Desktop.
Save nerdalert/5b25217492b0b928abea043121eed11d to your computer and use it in GitHub Desktop.
$ k logs llama-3.2-3b-instruct-decode-6dcb767b75-4c8c8 -c vllm
INFO 05-03 16:00:29 [__init__.py:239] Automatically detected platform cuda.
INFO 05-03 16:00:32 [api_server.py:1042] vLLM API server version 0.1.dev1+g9b70e2b
INFO 05-03 16:00:32 [api_server.py:1043] args: Namespace(host=None, port=8200, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-3.2-3B-Instruct', task='auto', tokenizer=None, tokenizer_mode='auto', trust_remote_code=False, dtype='auto', seed=None, hf_config_path=None, allowed_local_media_path='', revision=None, code_revision=None, rope_scaling={}, rope_theta=None, tokenizer_revision=None, max_model_len=None, quantization=None, enforce_eager=False, max_seq_len_to_capture=8192, max_logprobs=20, disable_sliding_window=False, disable_cascade_attn=False, skip_tokenizer_init=False, served_model_name=None, disable_async_output_proc=False, config_format='auto', hf_token=None, hf_overrides={}, override_neuron_config={}, override_pooler_config=None, logits_processor_pattern=None, generation_config='auto', override_generation_config={}, enable_sleep_mode=False, model_impl='auto', load_format='auto', download_dir=None, model_loader_extra_config={}, ignore_patterns=None, use_tqdm_on_load=True, qlora_adapter_name_or_path=None, guided_decoding_backend='auto', guided_decoding_disable_fallback=False, guided_decoding_disable_any_whitespace=False, guided_decoding_disable_additional_properties=False, enable_reasoning=None, reasoning_parser='', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, worker_cls='auto', worker_extension_cls='', block_size=None, gpu_memory_utilization=0.9, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, cuda_graph_sizes=[512], long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', compilation_config=None, kv_transfer_config=KVTransferConfig(kv_connector='LMCacheConnectorV1', kv_buffer_device='cuda', kv_buffer_size=1000000000.0, kv_role='kv_consumer', kv_rank=None, kv_parallel_size=1, kv_ip='127.0.0.1', kv_port=14579, kv_connector_extra_config={'discard_partial_chunks': False, 'lmcache_rpc_port': 'consumer1'}), kv_events_config=None, additional_config=None, use_v2_block_manager=True, disable_log_stats=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
INFO 05-03 16:00:39 [config.py:751] This model supports multiple tasks: {'classify', 'reward', 'score', 'embed', 'generate'}. Defaulting to 'generate'.
INFO 05-03 16:00:39 [config.py:2047] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 05-03 16:00:45 [__init__.py:239] Automatically detected platform cuda.
INFO 05-03 16:00:48 [core.py:59] Initializing a V1 LLM engine (v0.1.dev1+g9b70e2b) with config: model='meta-llama/Llama-3.2-3B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.2-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=meta-llama/Llama-3.2-3B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 05-03 16:00:48 [utils.py:2550] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x70912878d520>
INFO 05-03 16:00:49 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-03 16:00:49 [factory.py:64] Creating v1 connector with name: LMCacheConnectorV1
WARNING 05-03 16:00:49 [base.py:58] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
[2025-05-03 16:00:49,437] LMCache INFO: Loading LMCache config file /vllm-workspace/lmcache-decoder-config.yaml (utils.py:42:lmcache.integration.vllm.utils)
[2025-05-03 16:00:49,438] LMCache INFO: LMCache Configuration: {'chunk_size': 256, 'local_cpu': False, 'max_local_cpu_size': '0 GB', 'local_disk': None, 'max_local_disk_size': '0 GB', 'remote_url': None, 'remote_serde': None, 'save_decode_cache': False, 'enable_blending': False, 'blend_recompute_ratio': 0.15, 'blend_min_tokens': 256, 'enable_p2p': False, 'lookup_url': None, 'distributed_url': None, 'error_handling': False, 'enable_controller': False, 'lmcache_instance_id': 'lmcache_default_instance', 'enable_nixl': True, 'nixl_role': 'receiver', 'nixl_peer_host': '0.0.0.0', 'nixl_peer_port': 55555, 'nixl_buffer_size': 524288, 'nixl_buffer_device': 'cuda', 'nixl_enable_gc': True} (config.py:452:lmcache.experimental.config)
[2025-05-03 16:00:49,439] LMCache INFO: Creating LMCacheEngine instance vllm-instance (cache_engine.py:467:lmcache.experimental.cache_engine)
[2025-05-03 16:00:49,440] LMCache INFO: Creating LMCacheEngine with config: LMCacheEngineConfig(chunk_size=256, local_cpu=False, max_local_cpu_size=0, local_disk=None, max_local_disk_size=0, remote_url=None, remote_serde=None, save_decode_cache=False, enable_blending=False, blend_recompute_ratio=0.15, blend_min_tokens=256, blend_special_str=' # # ', enable_p2p=False, lookup_url=None, distributed_url=None, error_handling=False, enable_controller=False, lmcache_instance_id='lmcache_default_instance', controller_url=None, lmcache_worker_url=None, enable_nixl=True, nixl_role='receiver', nixl_peer_host='0.0.0.0', nixl_peer_port=55555, nixl_buffer_size=524288, nixl_buffer_device='cuda', nixl_enable_gc=True) (cache_engine.py:73:lmcache.experimental.cache_engine)
Failed to load plugin from /usr/local/nixl/lib/x86_64-linux-gnu/plugins/libplugin_UCX_MO.so: libplugin_UCX.so: cannot open shared object file: No such file or directory
Failed to load plugin 'UCX_MO' from any directory
Loaded plugin GDS
Loaded plugin UCX
##########################################################################
root 1 0 4 16:00 ? 00:00:15 python3 -m vllm.entrypoints.openai.api_server --port 8200 --kv-transfer-config {"kv_connector":"LMCacheConnectorV1","kv_role":"kv_consumer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "consumer1"}} --model meta-llama/Llama-3.2-3B-Instruct
root 119 1 0 16:00 ? 00:00:00 /workspace/vllm/.vllm/bin/python3 -c from multiprocessing.resource_tracker import main;main(36)
root 120 1 99 16:00 ? 00:05:26 /workspace/vllm/.vllm/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=37, pipe_handle=39) --multiprocessing-fork
root 204 0 0 16:01 pts/0 00:00:00 /bin/bash
root 215 204 0 16:06 pts/0 00:00:00 ps -eaf
# ss -tlnp | grep :8000
LISTEN 0 4096 *:8000 *:*
##########################################################################
$ kubectl describe pod llama-3.2-3b-instruct-prefill-7475648cc4-p4ww7 -n llm-d | \
sed -n '/Containers:/,/Volumes:/p'
Containers:
vllm:
Container ID: docker://f2a7ba72ea8a82d0474aa74b21140488a90d7c29182e527c9d965578c01ee558
Image: quay.io/llm-d/llm-d-dev:0.0.6
Image ID: docker-pullable://quay.io/llm-d/llm-d-dev@sha256:281e7ee67c8993d3f3f69ac27030fca3735be083056dd877b71861153d8da1e4
Port: 8000/TCP
Host Port: 0/TCP
Args:
--port
8000
--kv-transfer-config
{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_producer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "producer1"}}
--model
meta-llama/Llama-3.2-3B-Instruct
State: Running
Started: Sat, 03 May 2025 16:00:24 +0000
Ready: True
Restart Count: 0
Limits:
nvidia.com/gpu: 1
Requests:
cpu: 16
memory: 16Gi
nvidia.com/gpu: 1
Environment:
CUDA_VISIBLE_DEVICES: 0
UCX_TLS: cuda_ipc,cuda_copy,tcp
LMCACHE_CONFIG_FILE: /vllm-workspace/lmcache-prefiller-config.yaml
LMCACHE_USE_EXPERIMENTAL: True
VLLM_ENABLE_V1_MULTIPROCESSING: 1
VLLM_WORKER_MULTIPROC_METHOD: spawn
HF_HUB_CACHE: /vllm-workspace/models
HF_TOKEN: <set to the key 'HF_TOKEN' in secret 'llm-d-hf-token'> Optional: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zm684 (ro)
/vllm-workspace from config-prefiller (rw)
/vllm-workspace/models from model-cache (rw)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
$ kubectl describe pod llama-3.2-3b-instruct-decode-6dcb767b75-4c8c8
Name: llama-3.2-3b-instruct-decode-6dcb767b75-4c8c8
Namespace: llm-d
Priority: 0
Service Account: default
Node: minikube-m02/192.168.49.3
Start Time: Sat, 03 May 2025 16:00:20 +0000
Labels: llm-d.ai/inferenceServing=true
llm-d.ai/model=llama-3.2-3b-instruct
llm-d.ai/role=decode
pod-template-hash=6dcb767b75
Annotations: <none>
Status: Running
IP: 10.244.1.12
IPs:
IP: 10.244.1.12
Controlled By: ReplicaSet/llama-3.2-3b-instruct-decode-6dcb767b75
Init Containers:
routing-proxy:
Container ID: docker://bdda09ea4a2dc7e26624573754475687b208263d997a73596c726394679048ca
Image: quay.io/llm-d/llm-d-routing-sidecar-dev:0.0.6
Image ID: docker-pullable://quay.io/llm-d/llm-d-routing-sidecar-dev@sha256:4243179e3b0d33fbf9168c9b9296b1893776e91c584b2ec0c0a44fcaad5928d5
Port: 8000/TCP
Host Port: 0/TCP
Args:
--port=8000
--vllm-port=8200
State: Running
Started: Sat, 03 May 2025 16:00:21 +0000
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gv579 (ro)
Containers:
vllm:
Container ID: docker://9c316ac1dd21aca19c66e2f6987ed26c2e74f1b45fc30063d7c8875dac347e50
Image: quay.io/llm-d/llm-d-dev:0.0.6
Image ID: docker-pullable://quay.io/llm-d/llm-d-dev@sha256:281e7ee67c8993d3f3f69ac27030fca3735be083056dd877b71861153d8da1e4
Port: 55555/TCP
Host Port: 0/TCP
Args:
--port
8200
--kv-transfer-config
{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_consumer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "consumer1"}}
--model
meta-llama/Llama-3.2-3B-Instruct
State: Running
Started: Sat, 03 May 2025 16:00:21 +0000
Ready: True
Restart Count: 0
Limits:
nvidia.com/gpu: 1
Requests:
cpu: 16
memory: 16Gi
nvidia.com/gpu: 1
Environment:
CUDA_VISIBLE_DEVICES: 0
UCX_TLS: cuda_ipc,cuda_copy,tcp
LMCACHE_CONFIG_FILE: /vllm-workspace/lmcache-decoder-config.yaml
LMCACHE_USE_EXPERIMENTAL: True
VLLM_ENABLE_V1_MULTIPROCESSING: 1
VLLM_WORKER_MULTIPROC_METHOD: spawn
HF_HUB_CACHE: /vllm-workspace/models
HF_TOKEN: <set to the key 'HF_TOKEN' in secret 'llm-d-hf-token'> Optional: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gv579 (ro)
/vllm-workspace from config-decoder (rw)
/vllm-workspace/models from model-cache (rw)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
config-decoder:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: llm-d-modelservice-config-decoder
Optional: false
model-cache:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: 1Gi
model-storage:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: llama-3.2-3b-instruct-pvc
ReadOnly: true
kube-api-access-gv579:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
Optional: false
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 19m default-scheduler Successfully assigned llm-d/llama-3.2-3b-instruct-decode-6dcb767b75-4c8c8 to minikube-m02
Normal Pulled 19m kubelet Container image "quay.io/llm-d/llm-d-routing-sidecar-dev:0.0.6" already present on machine
Normal Created 19m kubelet Created container: routing-proxy
Normal Started 19m kubelet Started container routing-proxy
Normal Pulled 19m kubelet Container image "quay.io/llm-d/llm-d-dev:0.0.6" already present on machine
Normal Created 19m kubelet Created container: vllm
Normal Started 19m kubelet Started container vllm
##########################################################################
llm-d llama-3.2-3b-instruct-decode-6dcb767b75-4c8c8 2/2 Running 0 23m
llm-d llama-3.2-3b-instruct-epp-65c87574f5-mj52w 1/1 Running 0 23m
llm-d llama-3.2-3b-instruct-prefill-7475648cc4-p4ww7 1/1 Running 0 23m
llm-d llm-d-inference-gateway-5fbd8c566-v24tz 1/1 Running 0 35m
llm-d llm-d-modelservice-6b7b65cdcb-fzmjn 1/1 Running 0 35m
llm-d llm-d-redis-master-5b6f9445c7-crz2f 0/1 Pending 0 23m
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment