Troubleshoot Kubernetes

Troubleshooting Kubernetes issues beyond logs requires a systematic approach to inspect cluster components, resources, and configurations. Below, I outline key open-source tools and techniques (excluding log analysis, as you specified) to diagnose problems in Kubernetes related to running processes, open ports, network connections, hardware usage (CPU, memory, disk, network), and cluster-specific issues like pod failures, networking, or misconfigurations. These methods focus on on-demand troubleshooting and leverage tools that provide insights into the cluster's state.

1. Inspect Cluster Resources and Status

Tool/Method: kubectl
- Purpose: Native Kubernetes CLI to check the state of pods, nodes, services, deployments, and other resources.
- Techniques:
  - Check Pod Status: kubectl get pods -n <namespace> -o wide to see pod status (e.g., CrashLoopBackOff, Pending), node assignment, and IP addresses.
  - Describe Resources: kubectl describe pod <pod-name> -n <namespace> to view events, conditions, and reasons for failures (e.g., insufficient CPU/memory, image pull errors).
  - Node Health: kubectl get nodes and kubectl describe node <node-name> to check node conditions (e.g., NotReady, DiskPressure).
  - Resource Usage: kubectl top pod -n <namespace> or kubectl top node to monitor CPU and memory usage (requires Metrics Server).
- Example: kubectl describe pod my-app -n default to identify why a pod is stuck in Pending (e.g., no available nodes due to taints).
Why It Helps: Reveals misconfigurations, resource constraints, or scheduling issues without relying on logs.
Tool: Lens (Open Source)
- Purpose: Graphical interface for Kubernetes to visualize cluster resources, metrics, and statuses.
- Techniques:
  - View pod, node, and service details in a dashboard.
  - Check resource utilization (CPU, memory) and network endpoints.
  - Inspect events and conditions for troubleshooting.
- Example: Use Lens to spot a pod stuck in ImagePullBackOff and check associated node resources.
- Installation: Download from Lens GitHub or install via package managers.
- Why It Helps: Provides a user-friendly alternative to kubectl with real-time insights.

2. Monitor Processes and Resource Usage

Tool: Kubernetes Metrics Server
- Purpose: Collects resource usage metrics (CPU, memory) for pods and nodes, enabling troubleshooting of resource-related issues.
- Techniques:
  - Run kubectl top pod -n <namespace> to identify pods consuming excessive CPU/memory.
  - Check node capacity: kubectl top node to detect overcommitted nodes.
  - Look for resource limits/requests mismatches in pod specs (kubectl get pod <pod-name> -o yaml).
- Example: If a pod is evicted, use kubectl top pod to confirm if it exceeded its memory limit.
- Installation: Deploy via kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml.
- Why It Helps: Pinpoints resource bottlenecks causing pod failures or performance issues.
Tool: Prometheus with Node Exporter
- Purpose: Collects detailed system-level metrics (CPU, memory, disk, network) for Kubernetes nodes and pods.
- Techniques:
  - Query Prometheus for node metrics (e.g., node_cpu_seconds_total) to detect high CPU usage.
  - Check pod metrics via kube-state-metrics (e.g., pod restarts, pending pods).
  - Use Grafana dashboards to visualize resource trends and spot anomalies.
- On-Demand Use: Run Prometheus queries via its web UI (http://<prometheus>:9090) to diagnose issues.
- Example: Query rate(container_cpu_usage_seconds_total[5m]) to find pods with high CPU usage.
- Installation: Deploy via Helm (helm install prometheus prometheus-community/kube-prometheus-stack) or manifests.
- Why It Helps: Provides granular metrics for troubleshooting hardware-related issues in the cluster.

3. Diagnose Network Issues (Ports and Connections)

Tool: kubectl for Network Inspection
- Purpose: Checks service endpoints, DNS resolution, and network policies.
- Techniques:
  - Verify Services: kubectl get svc -n <namespace> and kubectl describe svc <service-name> to ensure correct port mappings and endpoints.
  - Check Endpoints: kubectl get endpoints <service-name> -n <namespace> to confirm pods are registered.
  - Test Connectivity: Use kubectl exec to run network tests inside a pod (e.g., kubectl exec -it <pod-name> -n <namespace> -- curl <service-name>).
  - Network Policies: kubectl get networkpolicy -n <namespace> to check if policies block traffic.
- Example: If a service is unreachable, use kubectl describe svc my-service to check for missing endpoints.
- Why It Helps: Identifies misconfigured services, DNS issues, or blocked connections.
Tool: k9s (Open Source)
- Purpose: Terminal-based UI for Kubernetes to inspect network resources and troubleshoot connectivity.
- Techniques:
  - View services, ingress, and network policies interactively.
  - Check pod-to-pod communication by inspecting pod IPs and ports.
  - Monitor events for network-related errors (e.g., ingress misconfiguration).
- Example: Use k9s to navigate to a service and verify its cluster IP and port configuration.
- Installation: Download from k9s GitHub or install via package managers.
- Why It Helps: Simplifies network troubleshooting with an interactive interface.
Tool: Cilium CLI (Open Source, for Cilium CNI)
- Purpose: Troubleshoots network connectivity in clusters using Cilium as the CNI.
- Techniques:
  - Run cilium status to check CNI health and Hubble observability.
  - Use cilium connectivity test to validate pod-to-pod and external connectivity.
  - Inspect network policies with cilium policy get.
- Example: cilium connectivity test to diagnose why pods cannot reach an external service.
- Installation: Install via curl -L --remote-name-all https://github.com/cilium/cilium-cli/releases/latest/download/cilium-linux-amd64.tar.gz && tar -C /usr/local/bin -xzvf cilium-linux-amd64.tar.gz.
- Why It Helps: Provides deep network diagnostics for clusters using Cilium.

4. Check Open Ports and Services

Tool: kubectl port-forward and netstat/ss
- Purpose: Verifies open ports and listening services within pods.
- Techniques:
  - Use kubectl port-forward <pod-name> <local-port>:<pod-port> to access a pod’s service locally and test connectivity.
  - Exec into a pod and run netstat -tuln or ss -tuln to list open ports (e.g., kubectl exec -it <pod-name> -- netstat -tuln).
  - Check if expected ports are open and bound to the correct process.
- Example: kubectl exec -it my-app -- ss -tuln to confirm port 8080 is listening.
- Why It Helps: Identifies port misconfigurations or conflicts causing service failures.
Tool: nmap (Open Source)
- Purpose: Scans for open ports on pod IPs or nodes to diagnose connectivity issues.
- Techniques:
  - Run nmap <pod-ip> from a pod or node to check open ports.
  - Use kubectl exec to run nmap inside a debugging pod (e.g., kubectl run debug --image=nicolaka/netshoot -- sleep infinity).
- Example: kubectl exec -it debug -- nmap 10.244.0.5 to scan a pod’s open ports.
- Installation: Install nmap in a debug container (e.g., apk add nmap in Alpine-based images).
- Why It Helps: Confirms whether services are accessible on expected ports.

5. Troubleshoot Hardware and Node Issues

Tool: kubectl for Node Diagnostics
- Purpose: Identifies node-level issues like resource exhaustion or taints.
- Techniques:
  - Check node conditions: kubectl describe node <node-name> to spot issues like MemoryPressure or NetworkUnavailable.
  - Verify taints/tolerations: kubectl get node -o yaml to ensure pods can schedule on nodes.
  - Inspect node resources: kubectl get node -o custom-columns=NAME:.metadata.name,CPU:.status.allocatable.cpu,MEMORY:.status.allocatable.memory.
- Example: kubectl describe node node-1 to diagnose why pods aren’t scheduling (e.g., taint NoSchedule).
- Why It Helps: Pinpoints node-level constraints affecting pod deployment.
Tool: Sysdig (Open Source)
- Purpose: Provides system-level insights into Kubernetes nodes (processes, CPU, memory, disk, network).
- Techniques:
  - Run sysdig -k to monitor node-level processes and resource usage.
  - Filter for Kubernetes-specific metrics (e.g., sysdig -M kubernetes.pod.name=my-app).
  - Identify high disk I/O or network bottlenecks affecting pods.
- Example: sysdig -k -c topprocs_cpu to find processes consuming CPU on a node.
- Installation: apt install sysdig or use Sysdig’s Kubernetes integration.
- Why It Helps: Offers low-level diagnostics for node performance issues.

6. Debug Pod and Container Issues

Tool: kubectl debug
- Purpose: Provides an interactive way to troubleshoot pods or nodes.
- Techniques:
  - Debug a pod: kubectl debug pod/<pod-name> -n <namespace> to start a privileged container for inspection.
  - Debug a node: kubectl debug node/<node-name> to access a node’s filesystem and processes.
  - Check container runtime issues (e.g., missing dependencies, permissions).
- Example: kubectl debug pod/my-app to inspect a pod stuck in CrashLoopBackOff.
- Why It Helps: Allows direct access to pod or node environments for real-time diagnostics.
Tool: netshoot (Open Source)
- Purpose: Debugging container with networking and system tools (e.g., curl, dig, tcpdump).
- Techniques:
  - Deploy a netshoot pod: kubectl run debug --image=nicolaka/netshoot -- sleep infinity.
  - Exec into netshoot to run tools like tcpdump or nmap for network diagnostics.
  - Test DNS: dig <service-name>.<namespace>.svc.cluster.local.
- Example: kubectl exec -it debug -- tcpdump -i eth0 to capture pod network traffic.
- Installation: Use the nicolaka/netshoot Docker image.
- Why It Helps: Provides a toolbox for in-depth pod-level troubleshooting.

7. Validate Configurations

Tool: Kubeval (Open Source)
- Purpose: Validates Kubernetes YAML manifests for correctness.
- Techniques:
  - Run kubeval <manifest.yaml> to check for invalid fields or deprecated APIs.
  - Ensure resource limits, labels, or selectors are correctly defined.
- Example: kubeval deployment.yaml to catch a typo in a pod spec.
- Installation: Download from Kubeval GitHub.
- Why It Helps: Prevents deployment failures due to configuration errors.
Tool: Kube-score (Open Source)
- Purpose: Analyzes Kubernetes manifests for best practices and potential issues.
- Techniques:
  - Run kube-score score <manifest.yaml> to get recommendations (e.g., missing resource limits, pod anti-affinity).
  - Identify security or performance issues before deployment.
- Example: kube-score score my-app.yaml to find a pod without a readiness probe.
- Installation: Install via go install github.com/zegl/kube-score@latest.
- Why It Helps: Catches configuration issues that cause runtime problems.

Recommendations

Quick Diagnostics: Use kubectl describe and kubectl top for immediate insights into pod and node issues.
Network Troubleshooting: Combine kubectl port-forward, netshoot, and Cilium CLI (if using Cilium) to diagnose connectivity.
Resource Issues: Leverage Metrics Server or Prometheus for CPU/memory bottlenecks.
Interactive Debugging: Use k9s or Lens for a visual, real-time troubleshooting experience.
Configuration Checks: Run kubeval and kube-score before deploying to avoid misconfigurations.

Notes

Prerequisites: Ensure kubectl is configured with cluster access. Metrics Server or Prometheus requires prior deployment for resource metrics.
Installation: Most tools are available via package managers, Helm, or direct downloads from their GitHub repositories.
On-Demand Use: Tools like kubectl, k9s, and netshoot are ideal for ad-hoc troubleshooting without persistent setup.
Reporting: Export kubectl outputs (e.g., kubectl get pods -o json > report.json) or use Prometheus/Grafana for visual reports.

erhangundogan/troubleshoot-kubernetes.md