k8s-troubleshooting
Comprehensive Kubernetes and OpenShift cluster health analysis and troubleshooting based on Popeye's issue detection patterns. Use this skill when: (1) Proactive cluster health assessment and security analysis (2) Analyzing pod/container logs for errors or issues (3) Interpreting cluster events (kubectl get events) (4) Debugging pod failures: CrashLoopBackOff, ImagePullBackOff, OOMKilled, etc. (5) Diagnosing networking issues: DNS, Service connectivity, Ingress/Route problems (6) Investigating storage issues: PVC pending, mount failures (7) Analyzing node problems: NotReady, resource pressure, taints (8) Troubleshooting OCP-specific issues: SCCs, Routes, Operators, Builds (9) Performance analysis and resource optimization (10) Security vulnerability assessment and RBAC validation (11) Configuration best practices validation (12) Reliability and high availability analysis
$ Installieren
git clone https://github.com/kcns008/cluster-code /tmp/cluster-code && cp -r /tmp/cluster-code/.claude/skills/k8s-troubleshooting ~/.claude/skills/cluster-code// tip: Run this command in your terminal to install the skill
name: k8s-troubleshooting description: | Comprehensive Kubernetes and OpenShift cluster health analysis and troubleshooting based on Popeye's issue detection patterns. Use this skill when: (1) Proactive cluster health assessment and security analysis (2) Analyzing pod/container logs for errors or issues (3) Interpreting cluster events (kubectl get events) (4) Debugging pod failures: CrashLoopBackOff, ImagePullBackOff, OOMKilled, etc. (5) Diagnosing networking issues: DNS, Service connectivity, Ingress/Route problems (6) Investigating storage issues: PVC pending, mount failures (7) Analyzing node problems: NotReady, resource pressure, taints (8) Troubleshooting OCP-specific issues: SCCs, Routes, Operators, Builds (9) Performance analysis and resource optimization (10) Security vulnerability assessment and RBAC validation (11) Configuration best practices validation (12) Reliability and high availability analysis
Kubernetes / OpenShift Troubleshooting Guide
Command Usage Convention
IMPORTANT: This skill uses kubectl as the primary command in all examples. When working with:
- OpenShift/ARO clusters: Replace all
kubectlcommands withoc - Standard Kubernetes clusters (AKS, EKS, GKE, etc.): Use
kubectlas shown
The agent will automatically detect the cluster type and use the appropriate command.
Systematic approach to diagnosing and resolving cluster issues through event analysis, log interpretation, and root cause identification.
Proactive Cluster Health Analysis (Popeye-Style)
Cluster Scoring Framework
Popeye uses a health scoring system (0-100) to assess cluster health. Critical issues reduce the score significantly:
- BOOM (Critical): -50 points - Security vulnerabilities, resource exhaustion, failed services
- WARN (Warning): -20 points - Configuration inefficiencies, best practice violations
- INFO (Informational): -5 points - Non-critical issues, optimization opportunities
Quick Cluster Health Assessment
#!/bin/bash
# Comprehensive cluster health check based on Popeye patterns
echo "=== POPEYE-STYLE CLUSTER HEALTH ASSESSMENT ==="
# 1. Node Health Check
echo "### NODE HEALTH (Critical Weight: 1.0) ###"
kubectl get nodes -o wide | grep -E "NotReady|Unknown" && echo "BOOM: Unhealthy nodes detected!" || echo "โ All nodes healthy"
# 2. Pod Issues Check
echo -e "\n### POD HEALTH (Critical Weight: 1.0) ###"
POD_ISSUES=$(kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded --no-headers | wc -l)
if [ $POD_ISSUES -gt 0 ]; then
echo "WARN: $POD_ISSUES pods not running"
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
else
echo "โ All pods running"
fi
# 3. Security Issues Check
echo -e "\n### SECURITY ASSESSMENT (Critical Weight: 1.0) ###"
# Check for privileged containers
PRIVILEGED=$(kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].securityContext.privileged == true) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
if [ $PRIVILEGED -gt 0 ]; then
echo "BOOM: $PRIVILEGED privileged containers detected (Security Risk!)"
else
echo "โ No privileged containers found"
fi
# Check for containers running as root
ROOT_CONTAINERS=$(kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].securityContext.runAsUser == 0) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
if [ $ROOT_CONTAINERS -gt 0 ]; then
echo "WARN: $ROOT_CONTAINERS containers running as root"
else
echo "โ No containers running as root"
fi
# 4. Resource Configuration Check
echo -e "\n### RESOURCE CONFIGURATION (Warning Weight: 0.8) ###"
NO_LIMITS=$(kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].resources.limits == null) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
if [ $NO_LIMITS -gt 0 ]; then
echo "WARN: $NO_LIMITS containers without resource limits"
else
echo "โ All containers have resource limits"
fi
# 5. Storage Issues Check
echo -e "\n### STORAGE HEALTH (Warning Weight: 0.5) ###"
PENDING_PVC=$(kubectl get pvc -A --field-selector=status.phase!=Bound --no-headers | wc -l)
if [ $PENDING_PVC -gt 0 ]; then
echo "WARN: $PENDING_PVC PVCs not bound"
kubectl get pvc -A --field-selector=status.phase!=Bound
else
echo "โ All PVCs bound"
fi
# 6. Network Issues Check
echo -e "\n### NETWORKING (Warning Weight: 0.5) ###"
# Check services without endpoints
EMPTY_ENDPOINTS=$(kubectl get svc -A -o json | jq -r '.items[] | select(.spec.clusterIP != "None") | select(.status.loadBalancer.ingress == null) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
if [ $EMPTY_ENDPOINTS -gt 0 ]; then
echo "WARN: $EMPTY_ENDPOINTS services may have endpoint issues"
else
echo "โ Services appear healthy"
fi
# OpenShift specific checks
if command -v oc &> /dev/null; then
echo -e "\n### OPENSHIFT CLUSTER OPERATORS (Critical Weight: 1.0) ###"
DEGRADED=$(oc get clusteroperators --no-headers | grep -c -E "False.*True|False.*False")
if [ $DEGRADED -gt 0 ]; then
echo "BOOM: $DEGRADED cluster operators degraded/unavailable"
oc get clusteroperators | grep -E "False.*True|False.*False"
else
echo "โ All cluster operators healthy"
fi
fi
Security Vulnerability Detection
Container Security Analysis
# Security Context Validation
echo "=== CONTAINER SECURITY ANALYSIS ==="
# 1. Privileged Containers (Critical)
kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{range .spec.containers[*]}{.name}{"\t"}{.securityContext.privileged}{"\n"}{end}{end}' | grep "true" && echo "BOOM: Privileged containers found!" || echo "โ No privileged containers"
# 2. Host Namespace Access (Critical)
kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.spec.hostNetwork}{"\t"}{.spec.hostPID}{"\t"}{.spec.hostIPC}{"\n"}' | grep -E "true.*true|true$|true\s" && echo "BOOM: Host namespace access detected!" || echo "โ No host namespace access"
# 3. Capabilities Check (Warning)
kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].securityContext.capabilities.add != null) | "\(.metadata.namespace)/\(.metadata.name): \(.spec.containers[].securityContext.capabilities.add[])"'
# 4. Read-Only Root Filesystem (Warning)
READONLY_ISSUES=$(kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].securityContext.readOnlyRootFilesystem == false) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
echo "INFO: $READONLY_ISSUES containers without read-only root filesystem"
RBAC Security Analysis
echo "=== RBAC SECURITY ANALYSIS ==="
# Check for overly permissive roles
kubectl get clusterroles -o json | jq -r '.items[] | select(.rules[].verbs[] == "*") | "\(.metadata.name): Wildcard permissions detected"'
# Check service account permissions
kubectl get serviceaccounts -A -o json | jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name)"'
Performance Issues Detection
Resource Utilization Analysis
echo "=== PERFORMANCE ANALYSIS ==="
# Find pods approaching memory limits
kubectl top pods -A --no-headers | awk '{print $4}' | sed 's/Mi//' | while read mem; do
if [ "$mem" -gt 900 ]; then
echo "WARN: Pod using high memory: ${mem}Mi"
fi
done
# CPU throttling detection
kubectl top pods -A --no-headers | awk '{print $3}' | sed 's/m//' | while read cpu; do
if [ "$cpu" -gt 900 ]; then
echo "WARN: Pod using high CPU: ${cpu}m"
fi
done
Configuration Best Practices Validation
Deployment Health Checks
echo "=== DEPLOYMENT BEST PRACTICES ==="
# Check for liveness/readiness probes
NO_PROBES=$(kubectl get deployments -A -o json | jq -r '.items[] | select(.spec.template.spec.containers[].livenessProbe == null or .spec.template.spec.containers[].readinessProbe == null) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
echo "INFO: $NO_PROBES deployments missing health probes"
# Check for pod disruption budgets
PDB_COUNT=$(kubectl get pdb -A --no-headers | wc -l)
DEPLOY_COUNT=$(kubectl get deployments -A --no-headers | wc -l)
echo "INFO: $PDB_COUNT pod disruption budgets for $DEPLOY_COUNT deployments"
# Rolling update strategy
NO_STRATEGY=$(kubectl get deployments -A -o json | jq -r '.items[] | select(.spec.strategy.type != "RollingUpdate") | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
echo "INFO: $NO_STRATEGY deployments not using RollingUpdate"
Troubleshooting Workflow
- Identify Scope: Pod, Node, Namespace, or Cluster-wide issue?
- Gather Context: Events, logs, resource status, recent changes
- Analyze Symptoms: Match patterns to known issues
- Determine Root Cause: Follow diagnostic tree
- Remediate: Apply fix and verify resolution
- Document: Record findings for future reference
Quick Diagnostic Commands
# Pod status overview
kubectl get pods -n ${NAMESPACE} -o wide
# Recent events (sorted by time)
kubectl get events -n ${NAMESPACE} --sort-by='.lastTimestamp'
# Pod details and events
kubectl describe pod ${POD_NAME} -n ${NAMESPACE}
# Container logs (current)
kubectl logs ${POD_NAME} -n ${NAMESPACE} -c ${CONTAINER}
# Container logs (previous crashed instance)
kubectl logs ${POD_NAME} -n ${NAMESPACE} -c ${CONTAINER} --previous
# Node status
kubectl get nodes -o wide
kubectl describe node ${NODE_NAME}
# Resource usage
kubectl top pods -n ${NAMESPACE}
kubectl top nodes
# OpenShift specific
oc get events -n ${NAMESPACE}
oc adm top pods -n ${NAMESPACE}
oc get clusteroperators
oc adm node-logs ${NODE_NAME} -u kubelet
Pod Status Interpretation
Pod Phase States
| Phase | Meaning | Action |
|---|---|---|
Pending | Not scheduled or pulling images | Check events, node resources, PVC status |
Running | At least one container running | Check container statuses if issues |
Succeeded | All containers completed successfully | Normal for Jobs |
Failed | All containers terminated, at least one failed | Check logs, exit codes |
Unknown | Cannot determine state | Node communication issue |
Container State Analysis
Waiting States
| Reason | Cause | Resolution |
|---|---|---|
ContainerCreating | Setting up container | Check events for errors, volume mounts |
ImagePullBackOff | Cannot pull image | Verify image name, registry access, credentials |
ErrImagePull | Image pull failed | Check image exists, network, ImagePullSecrets |
CreateContainerConfigError | Config error | Check ConfigMaps, Secrets exist and mounted correctly |
InvalidImageName | Malformed image reference | Fix image name in spec |
CrashLoopBackOff | Container repeatedly crashing | Check logs --previous, fix application |
Terminated States
| Reason | Exit Code | Cause | Resolution |
|---|---|---|---|
OOMKilled | 137 | Memory limit exceeded | Increase memory limit, fix memory leak |
Error | 1 | Application error | Check logs for stack trace |
Error | 126 | Command not executable | Fix command/entrypoint permissions |
Error | 127 | Command not found | Fix command path, verify image contents |
Error | 128 | Invalid exit code | Application bug |
Error | 130 | SIGINT (Ctrl+C) | Normal if manual termination |
Error | 137 | SIGKILL | OOM or forced termination |
Error | 143 | SIGTERM | Graceful shutdown requested |
Completed | 0 | Normal exit | Expected for Jobs/init containers |
Event Analysis
Event Types and Severity
Type: Normal โ Informational, typically no action needed
Type: Warning โ Potential issue, investigate
Critical Events to Monitor
Pod Scheduling Events
| Event Reason | Meaning | Resolution |
|---|---|---|
FailedScheduling | Cannot place pod | Check node resources, taints, affinity |
Unschedulable | No suitable node | Add nodes, adjust requirements |
NodeNotReady | Target node unavailable | Check node status |
TaintManagerEviction | Pod evicted due to taint | Check node taints, add tolerations |
FailedScheduling Analysis:
# Common messages and fixes:
"Insufficient cpu" โ Reduce requests or add capacity
"Insufficient memory" โ Reduce requests or add capacity
"node(s) had taint" โ Add toleration or remove taint
"node(s) didn't match selector" โ Fix nodeSelector/affinity
"persistentvolumeclaim not found" โ Create PVC or fix name
"0/3 nodes available" โ All nodes have issues, check each
Image Events
| Event Reason | Meaning | Resolution |
|---|---|---|
Pulling | Downloading image | Normal, wait |
Pulled | Image downloaded | Normal |
Failed | Pull failed | Check image name, registry, auth |
BackOff | Repeated pull failures | Fix underlying issue |
ErrImageNeverPull | Image not local with Never policy | Change imagePullPolicy or pre-pull |
ImagePullBackOff Diagnosis:
# Check image name is correct
kubectl get pod ${POD} -o jsonpath='{.spec.containers[*].image}'
# Verify ImagePullSecrets
kubectl get pod ${POD} -o jsonpath='{.spec.imagePullSecrets}'
kubectl get secret ${SECRET} -n ${NAMESPACE}
# Test registry access
kubectl run test --image=${IMAGE} --restart=Never --rm -it -- /bin/sh
Volume Events
| Event Reason | Meaning | Resolution |
|---|---|---|
FailedMount | Cannot mount volume | Check PVC, storage class, permissions |
FailedAttachVolume | Cannot attach volume | Check cloud provider, volume exists |
VolumeResizeFailed | Cannot expand volume | Check storage class allows expansion |
ProvisioningFailed | Cannot create volume | Check storage class, quotas |
PVC Pending Diagnosis:
# Check PVC status and events
kubectl describe pvc ${PVC_NAME} -n ${NAMESPACE}
# Verify StorageClass exists and is default
kubectl get storageclass
# Check for available PVs (if not dynamic provisioning)
kubectl get pv
# OpenShift: Check storage operator
oc get clusteroperator storage
Container Events
| Event Reason | Meaning | Resolution |
|---|---|---|
Created | Container created | Normal |
Started | Container started | Normal |
Killing | Container being stopped | Normal during updates/scale-down |
Unhealthy | Probe failed | Fix probe or application |
ProbeWarning | Probe returned warning | Check probe configuration |
BackOff | Container crashing | Check logs, fix application |
Event Patterns
Flapping Pod (Repeated restarts)
Events:
Warning BackOff Container is in waiting state due to CrashLoopBackOff
Normal Pulled Container image already present
Normal Created Created container
Normal Started Started container
Warning BackOff Back-off restarting failed container
Diagnosis: Check kubectl logs --previous, application is crashing on startup.
Resource Starvation
Events:
Warning FailedScheduling 0/3 nodes are available: 3 Insufficient cpu
Diagnosis: Cluster needs more capacity or pod requests are too high.
Probe Failures
Events:
Warning Unhealthy Liveness probe failed: HTTP probe failed with statuscode: 503
Normal Killing Container failed liveness probe, will be restarted
Diagnosis: Application not responding, check if startup is slow (use startupProbe) or app is unhealthy.
Log Analysis Patterns
Common Error Patterns
Application Startup Failures
# Java
java.lang.OutOfMemoryError: Java heap space
โ Increase memory limit, tune JVM heap (-Xmx)
java.net.ConnectException: Connection refused
โ Dependency not ready, add init container or retry logic
# Python
ModuleNotFoundError: No module named 'xxx'
โ Missing dependency, fix requirements.txt/Dockerfile
# Node.js
Error: Cannot find module 'xxx'
โ Missing dependency, fix package.json or node_modules
# General
ECONNREFUSED, Connection refused
โ Service dependency not available
ENOTFOUND, getaddrinfo ENOTFOUND
โ DNS resolution failed, check service name
Database Connection Issues
# PostgreSQL
FATAL: password authentication failed
โ Wrong credentials, check Secret values
connection refused
โ Database not running or wrong host/port
too many connections
โ Connection pool exhaustion, configure pool size
# MySQL
Access denied for user
โ Wrong credentials or missing grants
Can't connect to MySQL server
โ Database not running or network issue
# MongoDB
MongoNetworkError
โ Connection string wrong or network issue
Memory Issues
# Container OOMKilled
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
โ Solutions:
1. Increase memory limit
2. Profile application memory usage
3. Fix memory leaks
4. For JVM: Set -Xmx < container limit (leave ~25% headroom)
Permission Issues
# File system
Permission denied
mkdir: cannot create directory: Permission denied
โ Check securityContext, runAsUser, fsGroup
# OpenShift SCC
Error: container has runAsNonRoot and image has non-numeric user
โ Add runAsUser to securityContext
pods "xxx" is forbidden: unable to validate against any security context constraint
โ Create appropriate SCC or use service account with SCC access
Log Analysis Commands
# Search for errors in logs
kubectl logs ${POD} -n ${NS} | grep -iE "(error|exception|fatal|panic)"
# Follow logs in real-time
kubectl logs -f ${POD} -n ${NS}
# Logs from all containers in pod
kubectl logs ${POD} -n ${NS} --all-containers
# Logs from multiple pods (by label)
kubectl logs -l app=${APP_NAME} -n ${NS} --all-containers
# Logs with timestamps
kubectl logs ${POD} -n ${NS} --timestamps
# Logs from last hour
kubectl logs ${POD} -n ${NS} --since=1h
# Logs from last 100 lines
kubectl logs ${POD} -n ${NS} --tail=100
# OpenShift: Node-level logs
oc adm node-logs ${NODE} -u kubelet
oc adm node-logs ${NODE} -u crio
oc adm node-logs ${NODE} --path=journal
Node Troubleshooting
Node Conditions
| Condition | Status | Meaning |
|---|---|---|
Ready | True | Node healthy |
Ready | False | Kubelet not healthy |
Ready | Unknown | No heartbeat from node |
MemoryPressure | True | Low memory |
DiskPressure | True | Low disk space |
PIDPressure | True | Too many processes |
NetworkUnavailable | True | Network not configured |
Node NotReady Diagnosis
# Check node status
kubectl describe node ${NODE_NAME}
# Check kubelet status (SSH to node or via oc adm)
systemctl status kubelet
journalctl -u kubelet -f
# Check container runtime
systemctl status crio # or containerd/docker
journalctl -u crio -f
# Check node resources
df -h
free -m
top
# OpenShift: Machine status
oc get machines -n openshift-machine-api
oc describe machine ${MACHINE} -n openshift-machine-api
Node Resource Pressure
# Check resource allocation vs capacity
kubectl describe node ${NODE} | grep -A 10 "Allocated resources"
# Find pods using most resources
kubectl top pods -A --sort-by=cpu
kubectl top pods -A --sort-by=memory
# Evict pods from node (drain)
kubectl drain ${NODE} --ignore-daemonsets --delete-emptydir-data
Networking Troubleshooting
DNS Issues
# Test DNS resolution from a debug pod
kubectl run dns-test --image=busybox:1.28 --rm -it --restart=Never -- nslookup ${SERVICE_NAME}
kubectl run dns-test --image=busybox:1.28 --rm -it --restart=Never -- nslookup ${SERVICE_NAME}.${NAMESPACE}.svc.cluster.local
# Check CoreDNS/DNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns
# Check DNS service
kubectl get svc -n kube-system kube-dns
Service Connectivity
# Verify service exists and has endpoints
kubectl get svc ${SERVICE} -n ${NS}
kubectl get endpoints ${SERVICE} -n ${NS}
# Test service from debug pod
kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \
curl -v http://${SERVICE}.${NS}.svc.cluster.local:${PORT}
# Check if pods match service selector
kubectl get pods -n ${NS} -l ${SELECTOR} -o wide
Ingress/Route Issues
# Check Ingress status
kubectl describe ingress ${INGRESS} -n ${NS}
# Check Ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx
# OpenShift Route
oc describe route ${ROUTE} -n ${NS}
oc get route ${ROUTE} -n ${NS} -o yaml
# Check router pods
oc get pods -n openshift-ingress
oc logs -n openshift-ingress -l ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
NetworkPolicy Debugging
# List NetworkPolicies affecting a pod
kubectl get networkpolicy -n ${NS}
# Test connectivity with ephemeral debug container
kubectl debug ${POD} -n ${NS} --image=nicolaka/netshoot -- \
curl -v http://${TARGET_SERVICE}:${PORT}
# Check if traffic is blocked (look for drops)
# On node running the pod:
conntrack -L | grep ${POD_IP}
OpenShift-Specific Troubleshooting
Comprehensive OpenShift Health Assessment (Popeye-Style)
#!/bin/bash
# OpenShift comprehensive health check
echo "=== OPENSHIFT COMPREHENSIVE HEALTH ASSESSMENT ==="
# 1. Cluster Operators Health (Critical)
echo "### CLUSTER OPERATORS (Critical Weight: 1.0) ###"
oc get clusteroperators
echo ""
DEGRADED_OPERATORS=$(oc get clusteroperators --no-headers | grep -c -E "False.*True|False.*False")
if [ $DEGRADED_OPERATORS -gt 0 ]; then
echo "BOOM: $DEGRADED_OPERATORS cluster operators in degraded state!"
oc get clusteroperators | grep -E "False.*True|False.*False"
else
echo "โ All cluster operators healthy"
fi
# 2. OpenShift Routes Analysis (Warning)
echo -e "\n### ROUTES ANALYSIS (Warning Weight: 0.5) ###"
ROUTE_ISSUES=$(oc get routes -A -o json | jq -r '.items[] | select(.status.ingress == null) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
if [ $ROUTE_ISSUES -gt 0 ]; then
echo "WARN: $ROUTE_ISSUES routes without endpoints"
else
echo "โ All routes have endpoints"
fi
# Check TLS certificate issues
EXPIRED_CERTS=$(oc get routes -A -o json | jq -r '.items[] | select(.spec.tls != null) | select(.status.ingress[].conditions[]?.type == "Admitted" and .status.ingress[].conditions[]?.status == "False") | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
if [ $EXPIRED_CERTS -gt 0 ]; then
echo "WARN: $EXPIRED_CERTS routes with TLS issues"
fi
# 3. BuildConfig Health Analysis (Warning)
echo -e "\n### BUILDCONFIG ANALYSIS (Warning Weight: 0.5) ###"
FAILED_BUILDS=$(oc get builds -A --field-selector=status.phase!=Failed --no-headers | wc -l)
echo "INFO: $FAILED_BUILDS failed builds detected"
# Check BuildConfig strategies
BUILDCONFIGS=$(oc get buildconfigs -A --no-headers | wc -l)
echo "INFO: $BUILDCONFIGS build configurations found"
# 4. Security Context Constraints (Critical)
echo -e "\n### SCC ANALYSIS (Critical Weight: 1.0) ###"
SCC_VIOLATIONS=$(oc get events -A --field-selector=reason=FailedScheduling --no-headers | grep -c "unable to validate against any security context constraint")
if [ $SCC_VIOLATIONS -gt 0 ]; then
echo "BOOM: $SCC_VIOLATIONS SCC violations detected!"
oc get events -A --field-selector=reason=FailedScheduling | grep "unable to validate against any security context constraint"
else
echo "โ No SCC violations detected"
fi
# 5. ImageStream Health (Warning)
echo -e "\n### IMAGESTREAM ANALYSIS (Warning Weight: 0.3) ###"
STALE_IMAGES=$(oc get imagestreams -A -o json | jq -r '.items[] | select(.status.tags[]?.items? | length == 0) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
if [ $STALE_IMAGES -gt 0 ]; then
echo "WARN: $STALE_IMAGES ImageStreams without images"
else
echo "โ All ImageStreams have images"
fi
# 6. Project Resource Quotas (Warning)
echo -e "\n### PROJECT QUOTAS (Warning Weight: 0.5) ###"
oc get projects -A -o json | jq -r '.items[] | "\(.metadata.name): \(.status.phase)"' | while read project_info; do
project=$(echo $project_info | cut -d: -f1)
phase=$(echo $project_info | cut -d: -f2)
if [ "$phase" == "Active" ]; then
echo "โ Project $project is active"
else
echo "WARN: Project $project is $phase"
fi
done
Cluster Operators
# Check overall cluster health
oc get clusteroperators
# Investigate degraded operator
oc describe clusteroperator ${OPERATOR}
# Check operator logs
oc logs -n openshift-${OPERATOR} -l name=${OPERATOR}-operator
# OpenShift-specific: Check co-reconcile events
oc get events -A --field-selector reason=OperatorStatusChanged
Security Context Constraints (SCC)
# List SCCs
oc get scc
# Check which SCC a pod is using
oc get pod ${POD} -n ${NS} -o yaml | grep scc
# Check which SCC a serviceaccount can use
oc adm policy who-can use scc restricted-v2
# Add SCC to service account (requires admin)
oc adm policy add-scc-to-user ${SCC} -z ${SERVICE_ACCOUNT} -n ${NS}
Common SCC Issues:
Error: pods "xxx" is forbidden: unable to validate against any security context constraint
โ Diagnosis:
1. Check pod securityContext requirements
2. Find compatible SCC: oc get scc
3. Grant SCC to service account or adjust pod spec
Common fixes:
- Add runAsUser to match SCC requirements
- Use less restrictive SCC (not recommended for prod)
- Create custom SCC for specific needs
Build Failures
# Check build status
oc get builds -n ${NS}
oc describe build ${BUILD} -n ${NS}
# Build logs
oc logs build/${BUILD} -n ${NS}
oc logs -f bc/${BUILDCONFIG} -n ${NS}
# Check builder pod
oc get pods -n ${NS} | grep build
oc describe pod ${BUILD_POD} -n ${NS}
Common Build Issues:
| Error | Cause | Resolution |
|---|---|---|
error: build error: image not found | Base image missing | Check ImageStream or external registry |
AssembleInputError | S2I assemble failed | Check application dependencies |
GenericBuildFailed | Build command failed | Check build logs for details |
PushImageToRegistryFailed | Cannot push to registry | Check registry access, quotas |
ImageStream Issues
# Check ImageStream
oc get is ${IS_NAME} -n ${NS}
oc describe is ${IS_NAME} -n ${NS}
# Import external image
oc import-image ${IS_NAME}:${TAG} --from=${EXTERNAL_IMAGE} --confirm -n ${NS}
# Check image import status
oc get imagestreamtag ${IS_NAME}:${TAG} -n ${NS}
Performance Analysis
Resource Optimization
# Get actual resource usage vs requests
kubectl top pods -n ${NS}
# Compare with requests/limits
kubectl get pods -n ${NS} -o custom-columns=\
NAME:.metadata.name,\
CPU_REQ:.spec.containers[*].resources.requests.cpu,\
CPU_LIM:.spec.containers[*].resources.limits.cpu,\
MEM_REQ:.spec.containers[*].resources.requests.memory,\
MEM_LIM:.spec.containers[*].resources.limits.memory
# Find pods without resource limits
kubectl get pods -A -o json | jq -r \
'.items[] | select(.spec.containers[].resources.limits == null) |
"\(.metadata.namespace)/\(.metadata.name)"'
Right-Sizing Recommendations
| Symptom | Indication | Action |
|---|---|---|
| CPU throttling, high latency | CPU limit too low | Increase CPU limit |
| OOMKilled frequently | Memory limit too low | Increase memory limit |
| Low CPU utilization | Over-provisioned | Reduce CPU request |
| Low memory utilization | Over-provisioned | Reduce memory request |
| Pending pods | Cluster capacity full | Add nodes or optimize |
Latency Investigation
# Check pod startup time
kubectl get pod ${POD} -n ${NS} -o jsonpath='{.status.conditions}'
# Check container startup
kubectl get pod ${POD} -n ${NS} -o jsonpath='{.status.containerStatuses[*].state}'
# Slow image pulls
kubectl describe pod ${POD} -n ${NS} | grep -A 5 "Events"
# Network latency test
kubectl run nettest --image=nicolaka/netshoot --rm -it --restart=Never -- \
curl -w "@curl-format.txt" -o /dev/null -s http://${SERVICE}:${PORT}
Popeye-Style Diagnostic Decision Trees
Comprehensive Cluster Health Assessment Tree
Cluster Health Score < 80?
โโโ Yes โ Check Critical Issues (BOOM: -50 points each)
โ โโโ Node Health Issues?
โ โ โโโ NotReady nodes โ kubelet problems, resource pressure
โ โ โโโ Unknown nodes โ Network connectivity, API server
โ โ โโโ Resource pressure โ CPU/Memory/Disk pressure
โ โโโ Security Vulnerabilities?
โ โ โโโ Privileged containers โ Remove privileged flag
โ โ โโโ Host namespace access โ Remove hostNetwork/hostPID/hostIPC
โ โ โโโ Run as root โ Set runAsNonRoot: true, runAsUser > 0
โ โ โโโ Wildcard RBAC โ Create specific roles with minimal permissions
โ โโโ Service Failures?
โ โ โโโ No endpoints โ Fix service selector or pod labels
โ โ โโโ Failed load balancers โ Check cloud provider quotas
โ โ โโโ Certificate issues โ Renew TLS certificates
โ โโโ Resource Exhaustion?
โ โโโ OOMKilled pods โ Increase memory limits
โ โโโ CPU throttling โ Increase CPU limits
โ โโโ Storage full โ Clean up or expand storage
โโโ No โ Check Warning Issues (WARN: -20 points each)
โ โโโ Configuration Issues?
โ โ โโโ No resource limits โ Add requests/limits
โ โ โโโ No health probes โ Add liveness/readiness probes
โ โ โโโ Missing PDBs โ Create PodDisruptionBudgets
โ โ โโโ No rolling updates โ Use RollingUpdate strategy
โ โโโ Performance Issues?
โ โ โโโ Underutilized resources โ Right-size pods
โ โ โโโ Large container images โ Optimize Dockerfile
โ โ โโโ Inefficient scheduling โ Add affinity/anti-affinity
โ โโโ Reliability Issues?
โ โโโ Single replicas โ Increase replica count
โ โโโ No backup strategy โ Implement backup solution
โ โโโ Missing monitoring โ Add metrics and logging
โโโ Score >= 80 โ Check Info Issues (INFO: -5 points each)
โโโ Best Practice Violations?
โ โโโ Missing labels โ Add standard labels
โ โโโ No termination grace โ Set terminationGracePeriodSeconds
โ โโโ Deprecated APIs โ Update to newer API versions
โโโ Optimization Opportunities?
โโโ Unused resources โ Clean up orphaned resources
โโโ ImagePullPolicy: Always โ Use IfNotPresent for production
โโโ Large logs โ Implement log rotation
Pod Not Starting - Enhanced Diagnostic Tree
Pod Phase = Pending?
โโโ Yes โ Check Scheduling Issues
โ โโโ Events: FailedScheduling?
โ โ โโโ "Insufficient cpu/memory" โ
โ โ โ โโโ Add nodes OR
โ โ โ โโโ Reduce resource requests OR
โ โ โ โโโ Enable cluster autoscaler
โ โ โโโ "node(s) had taint" โ
โ โ โ โโโ Add toleration to pod OR
โ โ โ โโโ Remove taint from node
โ โ โโโ "node(s) didn't match nodeSelector" โ
โ โ โ โโโ Fix nodeSelector labels OR
โ โ โ โโโ Update node labels
โ โ โโโ "persistentvolumeclaim not found" โ
โ โ โ โโโ Create PVC with correct name OR
โ โ โ โโโ Fix PVC reference in pod
โ โ โโโ "0/X nodes available" โ Check all nodes for issues
โ โโโ No FailedScheduling events?
โ โโโ Check ResourceQuota โ Quota exceeded?
โ โโโ Check LimitRange โ Requests too small/large?
โ โโโ Check Namespace โ Namespace exists and not terminating?
โโโ No โ Pod Phase = Running with issues?
โโโ ContainerCreating > 5min?
โ โโโ Events: ImagePullBackOff?
โ โ โโโ Check image name/registry โ Fix image reference
โ โ โโโ Check ImagePullSecrets โ Create/update secrets
โ โ โโโ Test registry access โ kubectl run test-pod --image=xxx
โ โโโ Events: FailedMount?
โ โ โโโ PVC not bound โ Create PV or fix StorageClass
โ โ โโโ Secret/ConfigMap not found โ Create missing resources
โ โ โโโ Permission denied โ Fix securityContext, fsgroup
โ โโโ Events: CreateContainerConfigError?
โ โโโ Missing ConfigMap โ Create ConfigMap
โ โโโ Invalid volume mount โ Fix volumeMount path
โ โโโ Security context violation โ Adjust SCC or securityContext
โโโ Container status: Waiting/CrashLoopBackOff?
โโโ Exit code analysis:
โ โโโ 137 (OOMKilled) โ Increase memory limit
โ โโโ 1 (General error) โ Check application logs
โ โโโ 125/126/127 (Command issues) โ Fix entrypoint/command
โ โโโ 143 (SIGTERM) โ Graceful shutdown issue
โโโ No previous logs?
โโโ Application starts too slowly โ Add startupProbe
โโโ Entrypoint command not found โ Fix Dockerfile CMD/ENTRYPOINT
โโโ Permission denied โ Fix file permissions in image
Security Issues Diagnostic Tree
Security Issues Detected?
โโโ Privileged Containers (Critical)?
โ โโโ Find: kubectl get pods -A -o json | jq 'select(.spec.containers[].securityContext.privileged == true)'
โ โโโ Why: Dangerous escape from container isolation
โ โโโ Fix: Set privileged: false or use least privileged SCC
โโโ Host Namespace Access (Critical)?
โ โโโ Check: hostNetwork, hostPID, hostIPC = true
โ โโโ Why: Access to host system resources
โ โโโ Fix: Remove host namespace access, use specific alternatives
โโโ Root User Execution (Warning)?
โ โโโ Check: runAsUser = 0 or no runAsNonRoot
โ โโโ Why: Root access in container
โ โโโ Fix: Set runAsNonRoot: true, runAsUser: 1000+
โโโ Wildcard RBAC Permissions (Critical)?
โ โโโ Check: verbs: ["*"] or resources: ["*"]
โ โโโ Why: Over-privileged service accounts
โ โโโ Fix: Create specific roles with minimal permissions
โโโ Missing Security Context (Warning)?
โ โโโ Check: No securityContext at pod or container level
โ โโโ Why: Default settings may not be secure enough
โ โโโ Fix: Add securityContext with appropriate settings
โโโ Sensitive Data in Environment Variables (Critical)?
โโโ Check: Passwords, tokens, keys in env
โโโ Why: Visible via kubectl describe, logs
โโโ Fix: Use Secrets, consider external secret management
Performance Issues Diagnostic Tree
Performance Issues Detected?
โโโ Resource Utilization Issues?
โ โโโ High CPU Usage?
โ โ โโโ Symptoms: High latency, throttling
โ โ โโโ Diagnose: kubectl top pods, kubectl describe node
โ โ โโโ Solutions: Increase limits, optimize code, add replicas
โ โโโ Memory Pressure?
โ โ โโโ Symptoms: OOMKilled, swapping, slow performance
โ โ โโโ Diagnose: kubectl top pods, check events for OOM
โ โ โโโ Solutions: Increase limits, fix memory leaks, add nodes
โ โโโ Storage Issues?
โ โโโ Symptoms: Failed writes, slow I/O, PVC pending
โ โโโ Diagnose: kubectl get pv/pvc, df -h on nodes
โ โโโ Solutions: Expand PVs, add storage, optimize I/O patterns
โโโ Networking Performance?
โ โโโ DNS Resolution Delays?
โ โ โโโ Check: CoreDNS pods, node DNS config
โ โ โโโ Test: nslookup from debug pod
โ โ โโโ Fix: Scale CoreDNS, optimize DNS config
โ โโโ Service Connectivity Issues?
โ โ โโโ Check: Service endpoints, NetworkPolicies
โ โ โโโ Test: curl to service.cluster.local
โ โ โโโ Fix: Fix selectors, adjust NetworkPolicies
โ โโโ Ingress/Route Performance?
โ โโโ Check: Ingress controller resources, TLS config
โ โโโ Test: Load test with hey/wrk
โ โโโ Fix: Scale ingress, optimize TLS, add caching
โโโ Application-Specific Issues?
โโโ Slow Startup Times?
โ โโโ Check: Image size, initialization steps
โ โโโ Fix: Multi-stage builds, optimize startup
โ โโโ Configure: startupProbe with appropriate values
โโโ Database Connection Pool Issues?
โ โโโ Check: Connection limits, timeout settings
โ โโโ Monitor: Active connections, wait time
โ โโโ Fix: Adjust pool size, add connection retry logic
โโโ Cache Inefficiency?
โโโ Check: Hit ratios, cache size
โโโ Monitor: Memory usage, eviction rates
โโโ Fix: Optimize cache strategy, add external cache
OpenShift-Specific Issues Tree
OpenShift Issues Detected?
โโโ Cluster Operator Degraded?
โ โโโ Check: oc get clusteroperators
โ โโโ Investigate: oc describe clusteroperator <name>
โ โโโ Logs: oc logs -n openshift-<operator>
โ โโโ Common fixes:
โ โโโ authentication/oauth โ Check cert rotation
โ โโโ ingress โ Check router pods, certificates
โ โโโ storage โ Check storage class, provisioner
โ โโโ network โ Check CNI configuration
โโโ SCC Violations?
โ โโโ Check: Events for "unable to validate against any security context constraint"
โ โโโ Diagnose: oc get scc, oc adm policy who-can use scc
โ โโโ Fix:
โ โโโ Grant appropriate SCC to service account
โ โโโ Adjust pod securityContext to match SCC
โ โโโ Create custom SCC for specific needs
โโโ BuildConfig Failures?
โ โโโ Check: oc get builds, oc logs build/<build-name>
โ โโโ Common issues:
โ โ โโโ Source code access โ Git credentials, webhook
โ โ โโโ Base image not found โ ImageStream, registry
โ โ โโโ Build timeouts โ Increase timeout, optimize build
โ โ โโโ Registry push failures โ Permissions, quotas
โ โโโ Fix: Address specific build error, retry build
โโโ Route Issues?
โ โโโ Check: oc get routes, oc describe route <name>
โ โโโ Common issues:
โ โ โโโ No endpoints โ Service selector, pod health
โ โ โโโ TLS certificate expired โ Renew cert
โ โ โโโ Wrong host/path โ Update route spec
โ โ โโโ Router not responding โ Check router pods
โ โโโ Fix: Fix underlying service or update route config
โโโ ImageStream Issues?
โโโ Check: oc get imagestreams, oc describe is <name>
โโโ Common issues:
โ โโโ No tags/images โ Trigger import, fix image reference
โ โโโ Import failures โ Registry access, credentials
โ โโโ Tag not found โ Fix tag reference, re-tag image
โโโ Fix: Re-import image, fix registry connection
Application Not Reachable - Enhanced Tree
Application Connectivity Issue?
โโโ Service Level Issues?
โ โโโ Service has no endpoints?
โ โ โโโ Check: kubectl get endpoints <service>
โ โ โโโ Verify: Service selector matches pod labels
โ โ โโโ Check: Pod health and readiness
โ โ โโโ Fix: Update selector or fix pod issues
โ โโโ Service wrong type?
โ โ โโโ ClusterIP but expecting external โ Use LoadBalancer/NodePort
โ โ โโโ LoadBalancer not getting IP โ Check cloud provider
โ โ โโโ NodePort not accessible โ Check firewall, node ports
โ โโโ Service port wrong?
โ โโโ Check: targetPort vs containerPort
โ โโโ Verify: Protocol (TCP/UDP) matches
โ โโโ Fix: Update service port configuration
โโโ Ingress/Route Issues?
โ โโโ Ingress not found or misconfigured?
โ โ โโโ Check: kubectl get ingress, describe ingress
โ โ โโโ Verify: Host, path, backend service
โ โ โโโ Fix: Update ingress configuration
โ โโโ TLS Certificate Issues?
โ โ โโโ Check: Certificate expiration, validity
โ โโโ Verify: Secret exists and contains cert/key
โ โ โโโ Fix: Renew certificate, update secret
โ โโโ Ingress Controller Issues?
โ โ โโโ Check: Controller pod health
โ โโโ Verify: Controller service endpoints
โ โ โโโ Fix: Restart controller, fix configuration
โ โโโ Route-specific (OpenShift)?
โ โโโ Check: oc get routes, describe route
โ โโโ Verify: Router health, certificates
โ โโโ Fix: Update route, check router pods
โโโ NetworkPolicy Blocking?
โ โโโ Check: kubectl get networkpolicy
โ โโโ Verify: Policy allows traffic flow
โ โโโ Test: Temporarily disable policy for debugging
โ โโโ Fix: Add appropriate ingress/egress rules
โโโ Application Level Issues?
โโโ Application not binding to right port?
โ โโโ Check: Listen address (0.0.0.0 vs 127.0.0.1)
โ โโโ Verify: Port number matches containerPort
โ โโโ Fix: Update application bind configuration
โโโ Health check failures?
โ โโโ Check: Liveness/readiness probe paths
โ โโโ Verify: Application responds to probes
โ โโโ Fix: Update probe configuration or application
โโโ Application errors?
โโโ Check: Application logs for errors
โโโ Verify: Database connections, dependencies
โโโ Fix: Address application-specific issues
Health Check Scripts
Cluster Health Summary
#!/bin/bash
# cluster-health.sh - Quick cluster health check
echo "=== Node Status ==="
kubectl get nodes -o wide
echo -e "\n=== Pods Not Running ==="
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
echo -e "\n=== Recent Warning Events ==="
kubectl get events -A --field-selector type=Warning --sort-by='.lastTimestamp' | tail -20
echo -e "\n=== Resource Pressure ==="
kubectl top nodes
echo -e "\n=== PVCs Not Bound ==="
kubectl get pvc -A --field-selector=status.phase!=Bound
# OpenShift specific
if command -v oc &> /dev/null; then
echo -e "\n=== Cluster Operators ==="
oc get clusteroperators | grep -v "True.*False.*False"
fi
Namespace Health Check
#!/bin/bash
# namespace-health.sh ${NAMESPACE}
NS=${1:-default}
echo "=== Pods in $NS ==="
kubectl get pods -n $NS -o wide
echo -e "\n=== Recent Events ==="
kubectl get events -n $NS --sort-by='.lastTimestamp' | tail -15
echo -e "\n=== Resource Usage ==="
kubectl top pods -n $NS 2>/dev/null || echo "Metrics not available"
echo -e "\n=== Services ==="
kubectl get svc -n $NS
echo -e "\n=== Deployments ==="
kubectl get deploy -n $NS
echo -e "\n=== PVCs ==="
kubectl get pvc -n $NS
Quick Reference: Exit Codes
| Code | Signal | Meaning |
|---|---|---|
| 0 | - | Success |
| 1 | - | General error |
| 2 | - | Misuse of command |
| 126 | - | Command not executable |
| 127 | - | Command not found |
| 128 | - | Invalid exit argument |
| 130 | SIGINT | Keyboard interrupt |
| 137 | SIGKILL | Kill signal (OOM or forced) |
| 143 | SIGTERM | Termination signal |
| 255 | - | Exit status out of range |
Repository
