cluster-admin
Master Kubernetes cluster administration, from initial setup through production management. Learn cluster installation, scaling, upgrades, and HA strategies.
$ Installieren
git clone https://github.com/majiayu000/claude-skill-registry /tmp/claude-skill-registry && cp -r /tmp/claude-skill-registry/skills/devops/cluster-admin ~/.claude/skills/claude-skill-registry// tip: Run this command in your terminal to install the skill
name: cluster-admin description: Master Kubernetes cluster administration, from initial setup through production management. Learn cluster installation, scaling, upgrades, and HA strategies. sasmp_version: "1.3.0" eqhm_enabled: true bonded_agent: 01-cluster-admin bond_type: PRIMARY_BOND capabilities: ["Cluster lifecycle management", "Node administration", "HA configuration", "Cluster upgrades", "etcd management", "Resource quotas", "Namespace management", "Cluster autoscaling"] input_schema: type: object properties: action: type: string enum: ["create", "upgrade", "scale", "backup", "restore", "diagnose"] cluster_type: type: string enum: ["kind", "minikube", "kubeadm", "eks", "aks", "gke"] target: type: string output_schema: type: object properties: status: type: string commands: type: array recommendations: type: array
Cluster Administration
Executive Summary
Production-grade Kubernetes cluster administration covering the complete lifecycle from initial deployment to day-2 operations. This skill provides deep expertise in cluster architecture, high availability configurations, upgrade strategies, and operational best practices aligned with CKA/CKS certification standards.
Core Competencies
1. Cluster Architecture Mastery
Control Plane Components
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CONTROL PLANE โ
โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโค
โ API Server โ Scheduler โ Controller โ etcd โ
โ โ โ Manager โ โ
โ - AuthN โ - Pod โ - ReplicaSet โ - Cluster state โ
โ - AuthZ โ placement โ - Endpoints โ - 3+ nodes for HA โ
โ - Admission โ - Node โ - Namespace โ - Regular backups โ
โ control โ affinity โ - ServiceAcc โ - Encryption โ
โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ WORKER NODES โ
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ kubelet โ kube-proxy โ Container Runtime โ
โ - Pod lifecycle โ - iptables/ipvs โ - containerd (recommended) โ
โ - Node status โ - Service VIPs โ - CRI-O โ
โ - Volume mount โ - Load balance โ - gVisor (sandboxed) โ
โโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Production Cluster Bootstrap (kubeadm)
# Initialize control plane with HA
sudo kubeadm init \
--control-plane-endpoint "k8s-api.example.com:6443" \
--upload-certs \
--pod-network-cidr=10.244.0.0/16 \
--service-cidr=10.96.0.0/12 \
--apiserver-advertise-address=0.0.0.0 \
--apiserver-cert-extra-sans=k8s-api.example.com
# Join additional control plane nodes
kubeadm join k8s-api.example.com:6443 \
--token <token> \
--discovery-token-ca-cert-hash sha256:<hash> \
--control-plane \
--certificate-key <cert-key>
# Join worker nodes
kubeadm join k8s-api.example.com:6443 \
--token <token> \
--discovery-token-ca-cert-hash sha256:<hash>
2. Node Management
Node Lifecycle Operations
# View node details with resource usage
kubectl get nodes -o wide
kubectl top nodes
# Label nodes for workload placement
kubectl label nodes worker-01 node-type=compute tier=production
kubectl label nodes worker-02 node-type=gpu accelerator=nvidia-a100
# Taint nodes for dedicated workloads
kubectl taint nodes worker-gpu dedicated=gpu:NoSchedule
# Cordon node (prevent new pods)
kubectl cordon worker-03
# Drain node safely (for maintenance)
kubectl drain worker-03 \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=300 \
--timeout=600s
# Return node to service
kubectl uncordon worker-03
Node Problem Detector Configuration
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-problem-detector
namespace: kube-system
spec:
selector:
matchLabels:
app: node-problem-detector
template:
metadata:
labels:
app: node-problem-detector
spec:
containers:
- name: node-problem-detector
image: registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.14
securityContext:
privileged: true
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: log
mountPath: /var/log
readOnly: true
- name: kmsg
mountPath: /dev/kmsg
readOnly: true
volumes:
- name: log
hostPath:
path: /var/log
- name: kmsg
hostPath:
path: /dev/kmsg
tolerations:
- operator: Exists
effect: NoSchedule
3. High Availability Configuration
HA Architecture Pattern
โโโโโโโโโโโโโโโโโโโ
โ Load Balancer โ
โ (HAProxy/NLB) โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
โ Control Plane โ โ Control Plane โ โ Control Plane โ
โ Node 1 โ โ Node 2 โ โ Node 3 โ
โโโโโโโโโโโโโโโโโค โโโโโโโโโโโโโโโโโค โโโโโโโโโโโโโโโโโค
โ API Server โ โ API Server โ โ API Server โ
โ Scheduler โ โ Scheduler โ โ Scheduler โ
โ Controller โ โ Controller โ โ Controller โ
โ etcd โโโโโบโ etcd โโโโโบโ etcd โ
โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโดโโโโโโโโโ
โ Worker Nodes โ
โ (N instances) โ
โโโโโโโโโโโโโโโโโโโ
etcd Backup & Restore
# Backup etcd
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify backup
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-*.db --write-out=table
# Restore etcd (disaster recovery)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-*.db \
--data-dir=/var/lib/etcd-restored \
--name=etcd-0 \
--initial-cluster=etcd-0=https://10.0.0.10:2380 \
--initial-advertise-peer-urls=https://10.0.0.10:2380
# Automated backup CronJob
kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "0 */6 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: bitnami/etcd:3.5
command:
- /bin/sh
- -c
- |
etcdctl snapshot save /backup/etcd-\$(date +%Y%m%d-%H%M).db
env:
- name: ETCDCTL_API
value: "3"
volumeMounts:
- name: backup
mountPath: /backup
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
readOnly: true
volumes:
- name: backup
persistentVolumeClaim:
claimName: etcd-backup-pvc
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
restartPolicy: OnFailure
nodeSelector:
node-role.kubernetes.io/control-plane: ""
tolerations:
- key: node-role.kubernetes.io/control-plane
effect: NoSchedule
EOF
4. Cluster Upgrades
Upgrade Strategy Decision Tree
Upgrade Required?
โ
โโโ Minor Version (1.29 โ 1.30)
โ โโโ Review release notes for breaking changes
โ โโโ Test in staging environment
โ โโโ Upgrade control plane first
โ โ โโโ One node at a time
โ โโโ Upgrade workers (rolling)
โ
โโโ Patch Version (1.30.0 โ 1.30.1)
โ โโโ Generally safe, security fixes
โ โโโ Can upgrade more aggressively
โ
โโโ Major changes in components
โโโ Test thoroughly
โโโ Have rollback plan
โโโ Consider blue-green cluster
Production Upgrade Process
# Step 1: Upgrade kubeadm on control plane
sudo apt-mark unhold kubeadm
sudo apt-get update && sudo apt-get install -y kubeadm=1.30.0-00
sudo apt-mark hold kubeadm
# Step 2: Plan the upgrade
sudo kubeadm upgrade plan
# Step 3: Apply upgrade on first control plane
sudo kubeadm upgrade apply v1.30.0
# Step 4: Upgrade kubelet and kubectl
kubectl drain control-plane-1 --ignore-daemonsets
sudo apt-mark unhold kubelet kubectl
sudo apt-get install -y kubelet=1.30.0-00 kubectl=1.30.0-00
sudo apt-mark hold kubelet kubectl
sudo systemctl daemon-reload
sudo systemctl restart kubelet
kubectl uncordon control-plane-1
# Step 5: Upgrade additional control planes
sudo kubeadm upgrade node
# Then upgrade kubelet/kubectl as above
# Step 6: Upgrade worker nodes (rolling)
for node in $(kubectl get nodes -l node-role.kubernetes.io/worker -o name); do
kubectl drain $node --ignore-daemonsets --delete-emptydir-data
# SSH to node and upgrade packages
kubectl uncordon $node
sleep 60 # Allow pods to stabilize
done
5. Resource Management
Namespace Resource Quotas
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-backend
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
persistentvolumeclaims: "10"
requests.storage: 500Gi
pods: "50"
services: "20"
secrets: "50"
configmaps: "50"
---
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: team-backend
spec:
limits:
- type: Container
default:
cpu: 500m
memory: 512Mi
defaultRequest:
cpu: 100m
memory: 128Mi
min:
cpu: 50m
memory: 64Mi
max:
cpu: 4
memory: 8Gi
- type: PersistentVolumeClaim
min:
storage: 1Gi
max:
storage: 100Gi
Cluster Autoscaler Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
spec:
serviceAccountName: cluster-autoscaler
containers:
- name: cluster-autoscaler
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.30.0
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
- --balance-similar-node-groups
- --scale-down-enabled=true
- --scale-down-delay-after-add=10m
- --scale-down-unneeded-time=10m
- --scale-down-utilization-threshold=0.5
resources:
limits:
cpu: 100m
memory: 600Mi
requests:
cpu: 100m
memory: 600Mi
Integration Patterns
Uses skill: docker-containers
- Container runtime configuration
- Image management on nodes
- Registry authentication
Coordinates with skill: security
- RBAC for cluster admins
- Node security hardening
- Audit logging configuration
Works with skill: monitoring
- Cluster health dashboards
- Control plane metrics
- Node resource alerting
Troubleshooting Guide
Decision Tree: Cluster Health Issues
Cluster Health Problem?
โ
โโโ API Server unreachable
โ โโโ Check: systemctl status kube-apiserver
โ โโโ Check: /var/log/kube-apiserver.log
โ โโโ Verify: etcd connectivity
โ โโโ Verify: certificates not expired
โ
โโโ Node NotReady
โ โโโ Check: kubelet status on node
โ โโโ Check: container runtime status
โ โโโ Verify: node network connectivity
โ โโโ Check: disk pressure, memory pressure
โ
โโโ Pods Pending (no scheduling)
โ โโโ Check: kubectl describe pod
โ โโโ Verify: node resources available
โ โโโ Check: taints and tolerations
โ โโโ Verify: PVC bound (if using volumes)
โ
โโโ etcd Issues
โโโ Check: etcdctl endpoint health
โโโ Check: etcd member list
โโโ Verify: disk I/O performance
โโโ Check: cluster quorum
Debug Commands Cheatsheet
# Cluster-wide diagnostics
kubectl cluster-info dump --output-directory=/tmp/cluster-dump
kubectl get componentstatuses
kubectl get nodes -o wide
kubectl get events --sort-by='.lastTimestamp' -A
# Control plane health
kubectl get pods -n kube-system
kubectl logs -n kube-system kube-apiserver-<node>
kubectl logs -n kube-system kube-scheduler-<node>
kubectl logs -n kube-system kube-controller-manager-<node>
# etcd health
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Node diagnostics
kubectl describe node <node-name>
kubectl get node <node-name> -o yaml | grep -A 10 conditions
ssh <node> "journalctl -u kubelet --since '1 hour ago'"
# Certificate expiration check
kubeadm certs check-expiration
# Resource usage
kubectl top nodes
kubectl top pods -A --sort-by=memory
Common Challenges & Solutions
| Challenge | Solution |
|---|---|
| etcd performance degradation | Use SSD storage, tune compaction |
| Certificate expiration | Set up cert-manager, kubeadm renew |
| Node resource exhaustion | Configure eviction thresholds, resource quotas |
| Control plane overload | Add more control plane nodes, tune rate limits |
| Upgrade failures | Always backup etcd, use staged rollouts |
| kubelet not starting | Check containerd socket, certificates |
| API server latency | Enable priority/fairness, scale API servers |
| Cluster state drift | GitOps, regular audits, policy enforcement |
Success Criteria
| Metric | Target |
|---|---|
| Cluster uptime | 99.9% |
| API server latency p99 | <200ms |
| etcd backup success | 100% |
| Node ready status | 100% |
| Upgrade success rate | 100% |
| Certificate validity | >30 days |
| Control plane pods healthy | 100% |
Resources
Repository
