Last active
May 3, 2025 16:24
-
-
Save nerdalert/5b25217492b0b928abea043121eed11d to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ k logs llama-3.2-3b-instruct-decode-6dcb767b75-4c8c8 -c vllm | |
INFO 05-03 16:00:29 [__init__.py:239] Automatically detected platform cuda. | |
INFO 05-03 16:00:32 [api_server.py:1042] vLLM API server version 0.1.dev1+g9b70e2b | |
INFO 05-03 16:00:32 [api_server.py:1043] args: Namespace(host=None, port=8200, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-3.2-3B-Instruct', task='auto', tokenizer=None, tokenizer_mode='auto', trust_remote_code=False, dtype='auto', seed=None, hf_config_path=None, allowed_local_media_path='', revision=None, code_revision=None, rope_scaling={}, rope_theta=None, tokenizer_revision=None, max_model_len=None, quantization=None, enforce_eager=False, max_seq_len_to_capture=8192, max_logprobs=20, disable_sliding_window=False, disable_cascade_attn=False, skip_tokenizer_init=False, served_model_name=None, disable_async_output_proc=False, config_format='auto', hf_token=None, hf_overrides={}, override_neuron_config={}, override_pooler_config=None, logits_processor_pattern=None, generation_config='auto', override_generation_config={}, enable_sleep_mode=False, model_impl='auto', load_format='auto', download_dir=None, model_loader_extra_config={}, ignore_patterns=None, use_tqdm_on_load=True, qlora_adapter_name_or_path=None, guided_decoding_backend='auto', guided_decoding_disable_fallback=False, guided_decoding_disable_any_whitespace=False, guided_decoding_disable_additional_properties=False, enable_reasoning=None, reasoning_parser='', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, worker_cls='auto', worker_extension_cls='', block_size=None, gpu_memory_utilization=0.9, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, cuda_graph_sizes=[512], long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', compilation_config=None, kv_transfer_config=KVTransferConfig(kv_connector='LMCacheConnectorV1', kv_buffer_device='cuda', kv_buffer_size=1000000000.0, kv_role='kv_consumer', kv_rank=None, kv_parallel_size=1, kv_ip='127.0.0.1', kv_port=14579, kv_connector_extra_config={'discard_partial_chunks': False, 'lmcache_rpc_port': 'consumer1'}), kv_events_config=None, additional_config=None, use_v2_block_manager=True, disable_log_stats=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False) | |
INFO 05-03 16:00:39 [config.py:751] This model supports multiple tasks: {'classify', 'reward', 'score', 'embed', 'generate'}. Defaulting to 'generate'. | |
INFO 05-03 16:00:39 [config.py:2047] Chunked prefill is enabled with max_num_batched_tokens=2048. | |
INFO 05-03 16:00:45 [__init__.py:239] Automatically detected platform cuda. | |
INFO 05-03 16:00:48 [core.py:59] Initializing a V1 LLM engine (v0.1.dev1+g9b70e2b) with config: model='meta-llama/Llama-3.2-3B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.2-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=meta-llama/Llama-3.2-3B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512} | |
WARNING 05-03 16:00:48 [utils.py:2550] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x70912878d520> | |
INFO 05-03 16:00:49 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0 | |
INFO 05-03 16:00:49 [factory.py:64] Creating v1 connector with name: LMCacheConnectorV1 | |
WARNING 05-03 16:00:49 [base.py:58] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design. | |
[2025-05-03 16:00:49,437] LMCache INFO: Loading LMCache config file /vllm-workspace/lmcache-decoder-config.yaml (utils.py:42:lmcache.integration.vllm.utils) | |
[2025-05-03 16:00:49,438] LMCache INFO: LMCache Configuration: {'chunk_size': 256, 'local_cpu': False, 'max_local_cpu_size': '0 GB', 'local_disk': None, 'max_local_disk_size': '0 GB', 'remote_url': None, 'remote_serde': None, 'save_decode_cache': False, 'enable_blending': False, 'blend_recompute_ratio': 0.15, 'blend_min_tokens': 256, 'enable_p2p': False, 'lookup_url': None, 'distributed_url': None, 'error_handling': False, 'enable_controller': False, 'lmcache_instance_id': 'lmcache_default_instance', 'enable_nixl': True, 'nixl_role': 'receiver', 'nixl_peer_host': '0.0.0.0', 'nixl_peer_port': 55555, 'nixl_buffer_size': 524288, 'nixl_buffer_device': 'cuda', 'nixl_enable_gc': True} (config.py:452:lmcache.experimental.config) | |
[2025-05-03 16:00:49,439] LMCache INFO: Creating LMCacheEngine instance vllm-instance (cache_engine.py:467:lmcache.experimental.cache_engine) | |
[2025-05-03 16:00:49,440] LMCache INFO: Creating LMCacheEngine with config: LMCacheEngineConfig(chunk_size=256, local_cpu=False, max_local_cpu_size=0, local_disk=None, max_local_disk_size=0, remote_url=None, remote_serde=None, save_decode_cache=False, enable_blending=False, blend_recompute_ratio=0.15, blend_min_tokens=256, blend_special_str=' # # ', enable_p2p=False, lookup_url=None, distributed_url=None, error_handling=False, enable_controller=False, lmcache_instance_id='lmcache_default_instance', controller_url=None, lmcache_worker_url=None, enable_nixl=True, nixl_role='receiver', nixl_peer_host='0.0.0.0', nixl_peer_port=55555, nixl_buffer_size=524288, nixl_buffer_device='cuda', nixl_enable_gc=True) (cache_engine.py:73:lmcache.experimental.cache_engine) | |
Failed to load plugin from /usr/local/nixl/lib/x86_64-linux-gnu/plugins/libplugin_UCX_MO.so: libplugin_UCX.so: cannot open shared object file: No such file or directory | |
Failed to load plugin 'UCX_MO' from any directory | |
Loaded plugin GDS | |
Loaded plugin UCX | |
########################################################################## | |
root 1 0 4 16:00 ? 00:00:15 python3 -m vllm.entrypoints.openai.api_server --port 8200 --kv-transfer-config {"kv_connector":"LMCacheConnectorV1","kv_role":"kv_consumer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "consumer1"}} --model meta-llama/Llama-3.2-3B-Instruct | |
root 119 1 0 16:00 ? 00:00:00 /workspace/vllm/.vllm/bin/python3 -c from multiprocessing.resource_tracker import main;main(36) | |
root 120 1 99 16:00 ? 00:05:26 /workspace/vllm/.vllm/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=37, pipe_handle=39) --multiprocessing-fork | |
root 204 0 0 16:01 pts/0 00:00:00 /bin/bash | |
root 215 204 0 16:06 pts/0 00:00:00 ps -eaf | |
# ss -tlnp | grep :8000 | |
LISTEN 0 4096 *:8000 *:* | |
########################################################################## | |
$ kubectl describe pod llama-3.2-3b-instruct-prefill-7475648cc4-p4ww7 -n llm-d | \ | |
sed -n '/Containers:/,/Volumes:/p' | |
Containers: | |
vllm: | |
Container ID: docker://f2a7ba72ea8a82d0474aa74b21140488a90d7c29182e527c9d965578c01ee558 | |
Image: quay.io/llm-d/llm-d-dev:0.0.6 | |
Image ID: docker-pullable://quay.io/llm-d/llm-d-dev@sha256:281e7ee67c8993d3f3f69ac27030fca3735be083056dd877b71861153d8da1e4 | |
Port: 8000/TCP | |
Host Port: 0/TCP | |
Args: | |
--port | |
8000 | |
--kv-transfer-config | |
{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_producer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "producer1"}} | |
--model | |
meta-llama/Llama-3.2-3B-Instruct | |
State: Running | |
Started: Sat, 03 May 2025 16:00:24 +0000 | |
Ready: True | |
Restart Count: 0 | |
Limits: | |
nvidia.com/gpu: 1 | |
Requests: | |
cpu: 16 | |
memory: 16Gi | |
nvidia.com/gpu: 1 | |
Environment: | |
CUDA_VISIBLE_DEVICES: 0 | |
UCX_TLS: cuda_ipc,cuda_copy,tcp | |
LMCACHE_CONFIG_FILE: /vllm-workspace/lmcache-prefiller-config.yaml | |
LMCACHE_USE_EXPERIMENTAL: True | |
VLLM_ENABLE_V1_MULTIPROCESSING: 1 | |
VLLM_WORKER_MULTIPROC_METHOD: spawn | |
HF_HUB_CACHE: /vllm-workspace/models | |
HF_TOKEN: <set to the key 'HF_TOKEN' in secret 'llm-d-hf-token'> Optional: false | |
Mounts: | |
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zm684 (ro) | |
/vllm-workspace from config-prefiller (rw) | |
/vllm-workspace/models from model-cache (rw) | |
Conditions: | |
Type Status | |
PodReadyToStartContainers True | |
Initialized True | |
Ready True | |
ContainersReady True | |
PodScheduled True | |
Volumes: | |
$ kubectl describe pod llama-3.2-3b-instruct-decode-6dcb767b75-4c8c8 | |
Name: llama-3.2-3b-instruct-decode-6dcb767b75-4c8c8 | |
Namespace: llm-d | |
Priority: 0 | |
Service Account: default | |
Node: minikube-m02/192.168.49.3 | |
Start Time: Sat, 03 May 2025 16:00:20 +0000 | |
Labels: llm-d.ai/inferenceServing=true | |
llm-d.ai/model=llama-3.2-3b-instruct | |
llm-d.ai/role=decode | |
pod-template-hash=6dcb767b75 | |
Annotations: <none> | |
Status: Running | |
IP: 10.244.1.12 | |
IPs: | |
IP: 10.244.1.12 | |
Controlled By: ReplicaSet/llama-3.2-3b-instruct-decode-6dcb767b75 | |
Init Containers: | |
routing-proxy: | |
Container ID: docker://bdda09ea4a2dc7e26624573754475687b208263d997a73596c726394679048ca | |
Image: quay.io/llm-d/llm-d-routing-sidecar-dev:0.0.6 | |
Image ID: docker-pullable://quay.io/llm-d/llm-d-routing-sidecar-dev@sha256:4243179e3b0d33fbf9168c9b9296b1893776e91c584b2ec0c0a44fcaad5928d5 | |
Port: 8000/TCP | |
Host Port: 0/TCP | |
Args: | |
--port=8000 | |
--vllm-port=8200 | |
State: Running | |
Started: Sat, 03 May 2025 16:00:21 +0000 | |
Ready: True | |
Restart Count: 0 | |
Environment: <none> | |
Mounts: | |
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gv579 (ro) | |
Containers: | |
vllm: | |
Container ID: docker://9c316ac1dd21aca19c66e2f6987ed26c2e74f1b45fc30063d7c8875dac347e50 | |
Image: quay.io/llm-d/llm-d-dev:0.0.6 | |
Image ID: docker-pullable://quay.io/llm-d/llm-d-dev@sha256:281e7ee67c8993d3f3f69ac27030fca3735be083056dd877b71861153d8da1e4 | |
Port: 55555/TCP | |
Host Port: 0/TCP | |
Args: | |
--port | |
8200 | |
--kv-transfer-config | |
{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_consumer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "consumer1"}} | |
--model | |
meta-llama/Llama-3.2-3B-Instruct | |
State: Running | |
Started: Sat, 03 May 2025 16:00:21 +0000 | |
Ready: True | |
Restart Count: 0 | |
Limits: | |
nvidia.com/gpu: 1 | |
Requests: | |
cpu: 16 | |
memory: 16Gi | |
nvidia.com/gpu: 1 | |
Environment: | |
CUDA_VISIBLE_DEVICES: 0 | |
UCX_TLS: cuda_ipc,cuda_copy,tcp | |
LMCACHE_CONFIG_FILE: /vllm-workspace/lmcache-decoder-config.yaml | |
LMCACHE_USE_EXPERIMENTAL: True | |
VLLM_ENABLE_V1_MULTIPROCESSING: 1 | |
VLLM_WORKER_MULTIPROC_METHOD: spawn | |
HF_HUB_CACHE: /vllm-workspace/models | |
HF_TOKEN: <set to the key 'HF_TOKEN' in secret 'llm-d-hf-token'> Optional: false | |
Mounts: | |
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gv579 (ro) | |
/vllm-workspace from config-decoder (rw) | |
/vllm-workspace/models from model-cache (rw) | |
Conditions: | |
Type Status | |
PodReadyToStartContainers True | |
Initialized True | |
Ready True | |
ContainersReady True | |
PodScheduled True | |
Volumes: | |
config-decoder: | |
Type: ConfigMap (a volume populated by a ConfigMap) | |
Name: llm-d-modelservice-config-decoder | |
Optional: false | |
model-cache: | |
Type: EmptyDir (a temporary directory that shares a pod's lifetime) | |
Medium: | |
SizeLimit: 1Gi | |
model-storage: | |
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) | |
ClaimName: llama-3.2-3b-instruct-pvc | |
ReadOnly: true | |
kube-api-access-gv579: | |
Type: Projected (a volume that contains injected data from multiple sources) | |
TokenExpirationSeconds: 3607 | |
ConfigMapName: kube-root-ca.crt | |
Optional: false | |
DownwardAPI: true | |
QoS Class: Burstable | |
Node-Selectors: <none> | |
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s | |
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s | |
Events: | |
Type Reason Age From Message | |
---- ------ ---- ---- ------- | |
Normal Scheduled 19m default-scheduler Successfully assigned llm-d/llama-3.2-3b-instruct-decode-6dcb767b75-4c8c8 to minikube-m02 | |
Normal Pulled 19m kubelet Container image "quay.io/llm-d/llm-d-routing-sidecar-dev:0.0.6" already present on machine | |
Normal Created 19m kubelet Created container: routing-proxy | |
Normal Started 19m kubelet Started container routing-proxy | |
Normal Pulled 19m kubelet Container image "quay.io/llm-d/llm-d-dev:0.0.6" already present on machine | |
Normal Created 19m kubelet Created container: vllm | |
Normal Started 19m kubelet Started container vllm | |
########################################################################## | |
llm-d llama-3.2-3b-instruct-decode-6dcb767b75-4c8c8 2/2 Running 0 23m | |
llm-d llama-3.2-3b-instruct-epp-65c87574f5-mj52w 1/1 Running 0 23m | |
llm-d llama-3.2-3b-instruct-prefill-7475648cc4-p4ww7 1/1 Running 0 23m | |
llm-d llm-d-inference-gateway-5fbd8c566-v24tz 1/1 Running 0 35m | |
llm-d llm-d-modelservice-6b7b65cdcb-fzmjn 1/1 Running 0 35m | |
llm-d llm-d-redis-master-5b6f9445c7-crz2f 0/1 Pending 0 23m |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment