Troubleshooting Kubernetes issues beyond logs requires a systematic approach to inspect cluster components, resources, and configurations. Below, I outline key open-source tools and techniques (excluding log analysis, as you specified) to diagnose problems in Kubernetes related to running processes, open ports, network connections, hardware usage (CPU, memory, disk, network), and cluster-specific issues like pod failures, networking, or misconfigurations. These methods focus on on-demand troubleshooting and leverage tools that provide insights into the cluster's state.
-
Tool/Method:
kubectl
- Purpose: Native Kubernetes CLI to check the state of pods, nodes, services, deployments, and other resources.
- Techniques:
- Check Pod Status:
kubectl get pods -n <namespace> -o wide
to see pod status (e.g.,CrashLoopBackOff
,Pending
), node assignment, and IP addresses. - Describe Resources:
kubectl describe pod <pod-name> -n <namespace>
to view events, conditions, and reasons for failures (e.g., insufficient CPU/memory, image pull errors). - Node Health:
kubectl get nodes
andkubectl describe node <node-name>
to check node conditions (e.g.,NotReady
,DiskPressure
). - Resource Usage:
kubectl top pod -n <namespace>
orkubectl top node
to monitor CPU and memory usage (requires Metrics Server).
- Check Pod Status:
- Example:
kubectl describe pod my-app -n default
to identify why a pod is stuck inPending
(e.g., no available nodes due to taints).
-
Why It Helps: Reveals misconfigurations, resource constraints, or scheduling issues without relying on logs.
-
Tool: Lens (Open Source)
- Purpose: Graphical interface for Kubernetes to visualize cluster resources, metrics, and statuses.
- Techniques:
- View pod, node, and service details in a dashboard.
- Check resource utilization (CPU, memory) and network endpoints.
- Inspect events and conditions for troubleshooting.
- Example: Use Lens to spot a pod stuck in
ImagePullBackOff
and check associated node resources. - Installation: Download from Lens GitHub or install via package managers.
- Why It Helps: Provides a user-friendly alternative to
kubectl
with real-time insights.
-
Tool: Kubernetes Metrics Server
- Purpose: Collects resource usage metrics (CPU, memory) for pods and nodes, enabling troubleshooting of resource-related issues.
- Techniques:
- Run
kubectl top pod -n <namespace>
to identify pods consuming excessive CPU/memory. - Check node capacity:
kubectl top node
to detect overcommitted nodes. - Look for resource limits/requests mismatches in pod specs (
kubectl get pod <pod-name> -o yaml
).
- Run
- Example: If a pod is evicted, use
kubectl top pod
to confirm if it exceeded its memory limit. - Installation: Deploy via
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
. - Why It Helps: Pinpoints resource bottlenecks causing pod failures or performance issues.
-
Tool: Prometheus with Node Exporter
- Purpose: Collects detailed system-level metrics (CPU, memory, disk, network) for Kubernetes nodes and pods.
- Techniques:
- Query Prometheus for node metrics (e.g.,
node_cpu_seconds_total
) to detect high CPU usage. - Check pod metrics via kube-state-metrics (e.g., pod restarts, pending pods).
- Use Grafana dashboards to visualize resource trends and spot anomalies.
- Query Prometheus for node metrics (e.g.,
- On-Demand Use: Run Prometheus queries via its web UI (
http://<prometheus>:9090
) to diagnose issues. - Example: Query
rate(container_cpu_usage_seconds_total[5m])
to find pods with high CPU usage. - Installation: Deploy via Helm (
helm install prometheus prometheus-community/kube-prometheus-stack
) or manifests. - Why It Helps: Provides granular metrics for troubleshooting hardware-related issues in the cluster.
-
Tool:
kubectl
for Network Inspection- Purpose: Checks service endpoints, DNS resolution, and network policies.
- Techniques:
- Verify Services:
kubectl get svc -n <namespace>
andkubectl describe svc <service-name>
to ensure correct port mappings and endpoints. - Check Endpoints:
kubectl get endpoints <service-name> -n <namespace>
to confirm pods are registered. - Test Connectivity: Use
kubectl exec
to run network tests inside a pod (e.g.,kubectl exec -it <pod-name> -n <namespace> -- curl <service-name>
). - Network Policies:
kubectl get networkpolicy -n <namespace>
to check if policies block traffic.
- Verify Services:
- Example: If a service is unreachable, use
kubectl describe svc my-service
to check for missing endpoints. - Why It Helps: Identifies misconfigured services, DNS issues, or blocked connections.
-
Tool: k9s (Open Source)
- Purpose: Terminal-based UI for Kubernetes to inspect network resources and troubleshoot connectivity.
- Techniques:
- View services, ingress, and network policies interactively.
- Check pod-to-pod communication by inspecting pod IPs and ports.
- Monitor events for network-related errors (e.g., ingress misconfiguration).
- Example: Use k9s to navigate to a service and verify its cluster IP and port configuration.
- Installation: Download from k9s GitHub or install via package managers.
- Why It Helps: Simplifies network troubleshooting with an interactive interface.
-
Tool: Cilium CLI (Open Source, for Cilium CNI)
- Purpose: Troubleshoots network connectivity in clusters using Cilium as the CNI.
- Techniques:
- Run
cilium status
to check CNI health and Hubble observability. - Use
cilium connectivity test
to validate pod-to-pod and external connectivity. - Inspect network policies with
cilium policy get
.
- Run
- Example:
cilium connectivity test
to diagnose why pods cannot reach an external service. - Installation: Install via
curl -L --remote-name-all https://github.com/cilium/cilium-cli/releases/latest/download/cilium-linux-amd64.tar.gz && tar -C /usr/local/bin -xzvf cilium-linux-amd64.tar.gz
. - Why It Helps: Provides deep network diagnostics for clusters using Cilium.
-
Tool:
kubectl port-forward
andnetstat
/ss
- Purpose: Verifies open ports and listening services within pods.
- Techniques:
- Use
kubectl port-forward <pod-name> <local-port>:<pod-port>
to access a pod’s service locally and test connectivity. - Exec into a pod and run
netstat -tuln
orss -tuln
to list open ports (e.g.,kubectl exec -it <pod-name> -- netstat -tuln
). - Check if expected ports are open and bound to the correct process.
- Use
- Example:
kubectl exec -it my-app -- ss -tuln
to confirm port 8080 is listening. - Why It Helps: Identifies port misconfigurations or conflicts causing service failures.
-
Tool: nmap (Open Source)
- Purpose: Scans for open ports on pod IPs or nodes to diagnose connectivity issues.
- Techniques:
- Run
nmap <pod-ip>
from a pod or node to check open ports. - Use
kubectl exec
to runnmap
inside a debugging pod (e.g.,kubectl run debug --image=nicolaka/netshoot -- sleep infinity
).
- Run
- Example:
kubectl exec -it debug -- nmap 10.244.0.5
to scan a pod’s open ports. - Installation: Install
nmap
in a debug container (e.g.,apk add nmap
in Alpine-based images). - Why It Helps: Confirms whether services are accessible on expected ports.
-
Tool:
kubectl
for Node Diagnostics- Purpose: Identifies node-level issues like resource exhaustion or taints.
- Techniques:
- Check node conditions:
kubectl describe node <node-name>
to spot issues likeMemoryPressure
orNetworkUnavailable
. - Verify taints/tolerations:
kubectl get node -o yaml
to ensure pods can schedule on nodes. - Inspect node resources:
kubectl get node -o custom-columns=NAME:.metadata.name,CPU:.status.allocatable.cpu,MEMORY:.status.allocatable.memory
.
- Check node conditions:
- Example:
kubectl describe node node-1
to diagnose why pods aren’t scheduling (e.g., taintNoSchedule
). - Why It Helps: Pinpoints node-level constraints affecting pod deployment.
-
Tool: Sysdig (Open Source)
- Purpose: Provides system-level insights into Kubernetes nodes (processes, CPU, memory, disk, network).
- Techniques:
- Run
sysdig -k
to monitor node-level processes and resource usage. - Filter for Kubernetes-specific metrics (e.g.,
sysdig -M kubernetes.pod.name=my-app
). - Identify high disk I/O or network bottlenecks affecting pods.
- Run
- Example:
sysdig -k -c topprocs_cpu
to find processes consuming CPU on a node. - Installation:
apt install sysdig
or use Sysdig’s Kubernetes integration. - Why It Helps: Offers low-level diagnostics for node performance issues.
-
Tool:
kubectl debug
- Purpose: Provides an interactive way to troubleshoot pods or nodes.
- Techniques:
- Debug a pod:
kubectl debug pod/<pod-name> -n <namespace>
to start a privileged container for inspection. - Debug a node:
kubectl debug node/<node-name>
to access a node’s filesystem and processes. - Check container runtime issues (e.g., missing dependencies, permissions).
- Debug a pod:
- Example:
kubectl debug pod/my-app
to inspect a pod stuck inCrashLoopBackOff
. - Why It Helps: Allows direct access to pod or node environments for real-time diagnostics.
-
Tool: netshoot (Open Source)
- Purpose: Debugging container with networking and system tools (e.g.,
curl
,dig
,tcpdump
). - Techniques:
- Deploy a
netshoot
pod:kubectl run debug --image=nicolaka/netshoot -- sleep infinity
. - Exec into
netshoot
to run tools liketcpdump
ornmap
for network diagnostics. - Test DNS:
dig <service-name>.<namespace>.svc.cluster.local
.
- Deploy a
- Example:
kubectl exec -it debug -- tcpdump -i eth0
to capture pod network traffic. - Installation: Use the
nicolaka/netshoot
Docker image. - Why It Helps: Provides a toolbox for in-depth pod-level troubleshooting.
- Purpose: Debugging container with networking and system tools (e.g.,
-
Tool: Kubeval (Open Source)
- Purpose: Validates Kubernetes YAML manifests for correctness.
- Techniques:
- Run
kubeval <manifest.yaml>
to check for invalid fields or deprecated APIs. - Ensure resource limits, labels, or selectors are correctly defined.
- Run
- Example:
kubeval deployment.yaml
to catch a typo in a pod spec. - Installation: Download from Kubeval GitHub.
- Why It Helps: Prevents deployment failures due to configuration errors.
-
Tool: Kube-score (Open Source)
- Purpose: Analyzes Kubernetes manifests for best practices and potential issues.
- Techniques:
- Run
kube-score score <manifest.yaml>
to get recommendations (e.g., missing resource limits, pod anti-affinity). - Identify security or performance issues before deployment.
- Run
- Example:
kube-score score my-app.yaml
to find a pod without a readiness probe. - Installation: Install via
go install github.com/zegl/kube-score@latest
. - Why It Helps: Catches configuration issues that cause runtime problems.
- Quick Diagnostics: Use
kubectl describe
andkubectl top
for immediate insights into pod and node issues. - Network Troubleshooting: Combine
kubectl port-forward
,netshoot
, and Cilium CLI (if using Cilium) to diagnose connectivity. - Resource Issues: Leverage Metrics Server or Prometheus for CPU/memory bottlenecks.
- Interactive Debugging: Use
k9s
or Lens for a visual, real-time troubleshooting experience. - Configuration Checks: Run
kubeval
andkube-score
before deploying to avoid misconfigurations.
- Prerequisites: Ensure
kubectl
is configured with cluster access. Metrics Server or Prometheus requires prior deployment for resource metrics. - Installation: Most tools are available via package managers, Helm, or direct downloads from their GitHub repositories.
- On-Demand Use: Tools like
kubectl
,k9s
, andnetshoot
are ideal for ad-hoc troubleshooting without persistent setup. - Reporting: Export
kubectl
outputs (e.g.,kubectl get pods -o json > report.json
) or use Prometheus/Grafana for visual reports.