Kubernetes Operations
Kubernetes is the de facto container orchestration platform. This guide covers cluster operations, deployment patterns, and Helm chart management based on production experience running EKS clusters serving millions of users across Samsung services.
Kubernetes Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Control Plane β
β βββββββββββββ βββββββββββββ βββββββββββββ βββββββββββββ β
β β API Serverβ β etcd β βScheduler β β Controllerβ β
β β (kube-apisrv)β(data store)β (kube- β β Manager β β
β β β β β β scheduler)β β(kube-cm) β β
β β kubectl β β β All clusterβ β Assigns β β Manages β β
β β REST API β β state β β pods to β β replicas, β β
β βββββββββββββ βββββββββββββ β nodes β β services β β
β βββββββββββββ βββββββββββββ β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β (via API Server)
ββββββββββββββΌβββββββββββββ
βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββ βββββββββββββββββββ
β Worker Node 1 β β Worker N2 β β Worker Node 3 β
β βββββββββββββ β βββββββββββββ β βββββββββββββ β
β β kubelet β β ββkubelet ββ β β kubelet β β
β β(agent) β β ββ(agent) ββ β β (agent) β β
β βββββββββββββ€ β βββββββββββ€β β βββββββββββββ€ β
β β kube-proxyβ β ββkube- ββ β β kube-proxyβ β
β β(network) β β ββproxy ββ β β (network) β β
β βββββββββββββ€ β βββββββββββ€β β βββββββββββββ€ β
β β Container β β ββContainerββ β β Container β β
β β Runtime β β ββRuntime ββ β β Runtime β β
β β(containerdβ β β(containerd)β β(containerdβ β
β βββββββββββββ€ β βββββββββββ€β β βββββββββββββ€ β
β β [Pod] [Pod]β β β[Pod][Pod]β β β [Pod] [Pod]β β
β β [ c c ] β β β[c c] ββ β β [ c c c ] β β
β βββββββββββββ β βββββββββββ β βββββββββββββ β
βββββββββββββββββββ βββββββββββββ βββββββββββββββββββ
Control Plane Components
| Component | Function | Failure Impact |
| API Server (kube-apiserver) | Exposes Kubernetes API; front end for all cluster operations | Cluster unmanageable; existing workloads unaffected |
| etcd | Distributed key-value store for all cluster state | Complete cluster outage; data loss if unbacked |
| Scheduler (kube-scheduler) | Assigns pods to nodes based on resources and constraints | New pods not scheduled; existing pods run |
| Controller Manager (kube-controller-manager) | Runs controllers (replication, endpoints, service account) | Self-healing stops; auto-scaling fails |
| Cloud Controller Manager | Integrates with cloud provider (AWS, Azure, GCP) | Load balancers, volumes not provisioned |
Worker Node Components
| Component | Function |
| kubelet | Agent that ensures containers run as specified in PodSpec |
| kube-proxy | Maintains network rules and connection forwarding |
| Container Runtime | Executes containers (containerd, CRI-O) |
Key Resources
| Resource | Purpose | Key Fields |
| Pod | Smallest deployable unit; contains one or more containers | containers, volumes, restartPolicy |
| Deployment | Manages Pod replicas; supports rolling updates | replicas, strategy, selector, template |
| StatefulSet | Manages stateful apps with stable network identity and storage | serviceName, volumeClaimTemplates, podManagementPolicy |
| DaemonSet | Ensures one pod per node (logging, monitoring agents) | nodeSelector, tolerations |
| Service | Exposes pods via stable IP/DNS (ClusterIP, NodePort, LoadBalancer) | selector, ports, type |
| Ingress | HTTP/HTTPS routing rules to services | rules, tls, annotations (ingress class) |
| ConfigMap | Non-sensitive configuration data | data, binaryData |
| Secret | Sensitive data (passwords, tokens, keys) | type (Opaque, tls, docker-registry), data (base64) |
| PersistentVolume | Storage resource in the cluster | capacity, accessModes, storageClassName |
| PersistentVolumeClaim | Request for storage by a pod | resources.requests.storage, storageClassName |
| ServiceAccount | Identity for pods accessing the API | automountServiceAccountToken |
| Role / ClusterRole | Permission definitions (RBAC) | rules (apiGroups, resources, verbs) |
| RoleBinding / ClusterRoleBinding | Assigns roles to subjects | subjects, roleRef |
| NetworkPolicy | Pod-level firewall rules | podSelector, policyTypes, ingress, egress |
| HorizontalPodAutoscaler | Auto-scales pods based on CPU/memory/custom metrics | scaleTargetRef, minReplicas, maxReplicas, metrics |
kubectl Command Reference
| Command | Description |
kubectl get pods -n <ns> | List pods in a namespace |
kubectl get pods -o wide | List pods with node and IP details |
kubectl get all -n <ns> | List all resources in namespace |
kubectl describe pod <name> | Detailed pod information and events |
kubectl logs <pod> -c <container> | View container logs |
kubectl logs <pod> --tail=100 -f | Follow last 100 log lines |
kubectl logs <pod> --previous | Logs from previous container instance (crash debugging) |
kubectl exec -it <pod> -- /bin/sh | Interactive shell in running pod |
kubectl apply -f manifest.yaml | Apply configuration from file |
kubectl apply -f directory/ | Apply all manifests in directory |
kubectl delete -f manifest.yaml | Delete resources from manifest |
kubectl port-forward pod/<name> 8080:80 | Forward local port to pod |
kubectl top pod -n <ns> | Show resource usage (requires metrics-server) |
kubectl get events --sort-by=.lastTimestamp | View cluster events sorted by time |
kubectl rollout status deployment/<name> | Check rollout progress |
kubectl rollout history deployment/<name> | View rollout history |
kubectl rollout undo deployment/<name> | Rollback to previous revision |
kubectl get nodes -o custom-columns=... | Custom output formatting |
kubectl config use-context <context> | Switch kubeconfig context |
kubectl get ns | List namespaces |
kubectl create ns <name> | Create namespace |
kubectl get svc -n <ns> | List services |
kubectl get ingress -n <ns> | List ingresses |
kubectl get pv,pvc -n <ns> | List persistent volumes and claims |
kubectl auth can-i <verb> <resource> | Check RBAC permissions |
kubectl explain pod.spec | Get API documentation for resource field |
Complete Deployment + Service YAML Example
# manifests/web-app.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-service
namespace: production
labels:
app: payment-service
version: v1.2.3
tier: backend
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 0
selector:
matchLabels:
app: payment-service
template:
metadata:
labels:
app: payment-service
version: v1.2.3
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
spec:
serviceAccountName: payment-service-sa
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
containers:
- name: payment-service
image: "123456789012.dkr.ecr.us-east-1.amazonaws.com/payment-service:v1.2.3"
imagePullPolicy: Always
ports:
- name: http
containerPort: 8080
protocol: TCP
- name: metrics
containerPort: 9090
protocol: TCP
env:
- name: NODE_ENV
value: "production"
- name: PORT
value: "8080"
- name: DB_HOST
valueFrom:
secretKeyRef:
name: payment-service-secrets
key: db_host
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: payment-service-secrets
key: db_password
- name: REDIS_URL
valueFrom:
configMapKeyRef:
name: payment-service-config
key: redis_url
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "1000m"
livenessProbe:
httpGet:
path: /health/live
port: http
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: http
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
volumeMounts:
- name: tmp
mountPath: /tmp
volumes:
- name: tmp
emptyDir: {}
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: payment-service
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- payment-service
topologyKey: kubernetes.io/hostname
---
apiVersion: v1
kind: Service
metadata:
name: payment-service
namespace: production
labels:
app: payment-service
spec:
type: ClusterIP
selector:
app: payment-service
ports:
- name: http
port: 80
targetPort: 8080
protocol: TCP
- name: metrics
port: 9090
targetPort: 9090
protocol: TCP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: payment-service
namespace: production
annotations:
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789012:certificate/abcd1234
alb.ingress.kubernetes.io/healthcheck-path: /health/ready
spec:
ingressClassName: alb
rules:
- host: api.example.com
http:
paths:
- path: /payments
pathType: Prefix
backend:
service:
name: payment-service
port:
number: 80
Helm Charts: Structure, values.yaml, Templating
Helm is the package manager for Kubernetes. Charts encapsulate Kubernetes manifests with templating for configuration:
Chart Structure
payment-service-chart/
βββ Chart.yaml # Chart metadata (name, version, dependencies)
βββ values.yaml # Default configuration values
βββ values-production.yaml # Environment-specific overrides
βββ .helmignore # Files to exclude from packaging
βββ charts/ # Sub-charts (dependencies)
βββ templates/
β βββ _helpers.tpl # Named template definitions
β βββ deployment.yaml
β βββ service.yaml
β βββ ingress.yaml
β βββ hpa.yaml # Horizontal Pod Autoscaler
β βββ serviceaccount.yaml
β βββ secret.yaml
β βββ configmap.yaml
β βββ pdb.yaml # Pod Disruption Budget
β βββ NOTES.txt # Post-install instructions
βββ tests/
βββ test-connection.yaml # Helm test pod
Complete Helm Chart Example
# payment-service-chart/Chart.yaml
apiVersion: v2
name: payment-service
description: Payment service Helm chart
type: application
version: 1.2.3
appVersion: "1.2.3"
kubeVersion: ">=1.27.0"
keywords:
- payment
- backend
maintainers:
- name: Platform Team
email: platform@example.com
dependencies:
- name: redis
version: "~> 18.0"
repository: "https://charts.bitnami.com/bitnami"
condition: redis.enabled
# payment-service-chart/values.yaml
# ββ Global βββββββββββββββββββββββββββββββββββββββββββββββ
nameOverride: ""
fullnameOverride: ""
# ββ Image ββββββββββββββββββββββββββββββββββββββββββββββββ
image:
repository: 123456789012.dkr.ecr.us-east-1.amazonaws.com/payment-service
pullPolicy: Always
tag: "" # Defaults to Chart appVersion
imagePullSecrets:
- name: ecr-registry-secret
# ββ Replicas βββββββββββββββββββββββββββββββββββββββββββββ
replicaCount: 2
# ββ Deployment Strategy ββββββββββββββββββββββββββββββββββ
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 0
# ββ Service Account ββββββββββββββββββββββββββββββββββββββ
serviceAccount:
create: true
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/payment-service-role
name: ""
# ββ Pod Security Context βββββββββββββββββββββββββββββββββ
podSecurityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
seccompProfile:
type: RuntimeDefault
# ββ Service ββββββββββββββββββββββββββββββββββββββββββββββ
service:
type: ClusterIP
port: 80
targetPort: 8080
# ββ Ingress ββββββββββββββββββββββββββββββββββββββββββββββ
ingress:
enabled: true
className: alb
annotations:
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
alb.ingress.kubernetes.io/certificate-arn: ""
alb.ingress.kubernetes.io/healthcheck-path: /health/ready
hosts:
- host: api.example.com
paths:
- path: /payments
pathType: Prefix
tls: []
# ββ Resources ββββββββββββββββββββββββββββββββββββββββββββ
resources:
limits:
cpu: 1000m
memory: 512Mi
requests:
cpu: 250m
memory: 256Mi
# ββ Autoscaling ββββββββββββββββββββββββββββββββββββββββββ
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
# ββ Probes βββββββββββββββββββββββββββββββββββββββββββββββ
livenessProbe:
httpGet:
path: /health/live
port: http
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: http
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
# ββ Pod Disruption Budget ββββββββββββββββββββββββββββββββ
podDisruptionBudget:
enabled: true
minAvailable: 1
# ββ Topology Spread ββββββββββββββββββββββββββββββββββββββ
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
# ββ Node Selector, Tolerations, Affinity βββββββββββββββββ
nodeSelector: {}
tolerations: []
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/name: payment-service
topologyKey: kubernetes.io/hostname
# ββ ConfigMap ββββββββββββββββββββββββββββββββββββββββββββ
configMap:
data:
NODE_ENV: "production"
PORT: "8080"
LOG_LEVEL: "info"
REDIS_URL: "redis://payment-service-redis-master:6379"
# ββ Secrets ββββββββββββββββββββββββββββββββββββββββββββββ
secrets:
db_host: "payment-db.cluster-xxx.us-east-1.rds.amazonaws.com"
# db_password is injected via external-secrets operator
# ββ Redis Dependency βββββββββββββββββββββββββββββββββββββ
redis:
enabled: true
architecture: standalone
auth:
enabled: false
# templates/deployment.yaml
{{- $fullName := include "payment-service.fullname" . -}}
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ $fullName }}
labels:
{{- include "payment-service.labels" . | nindent 4 }}
app.kubernetes.io/version: {{ .Values.image.tag | default .Chart.AppVersion | quote }}
spec:
{{- if not .Values.autoscaling.enabled }}
replicas: {{ .Values.replicaCount }}
{{- end }}
strategy:
{{- toYaml .Values.strategy | nindent 4 }}
selector:
matchLabels:
{{- include "payment-service.selectorLabels" . | nindent 6 }}
template:
metadata:
labels:
{{- include "payment-service.selectorLabels" . | nindent 8 }}
app.kubernetes.io/version: {{ .Values.image.tag | default .Chart.AppVersion | quote }}
annotations:
checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
checksum/secrets: {{ include (print $.Template.BasePath "/secret.yaml") . | sha256sum }}
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
rollme: {{ randAlphaNum 5 | quote }} # Force rollout on every deploy
spec:
serviceAccountName: {{ include "payment-service.serviceAccountName" . }}
securityContext:
{{- toYaml .Values.podSecurityContext | nindent 8 }}
containers:
- name: {{ .Chart.Name }}
securityContext:
{{- toYaml .Values.securityContext | nindent 12 }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
ports:
- name: http
containerPort: {{ .Values.service.targetPort }}
protocol: TCP
envFrom:
- configMapRef:
name: {{ $fullName }}-config
- secretRef:
name: {{ $fullName }}-secrets
livenessProbe:
{{- toYaml .Values.livenessProbe | nindent 12 }}
readinessProbe:
{{- toYaml .Values.readinessProbe | nindent 12 }}
resources:
{{- toYaml .Values.resources | nindent 12 }}
volumeMounts:
- name: tmp
mountPath: /tmp
{{- with .Values.topologySpreadConstraints }}
topologySpreadConstraints:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.affinity }}
affinity:
{{- toYaml . | nindent 8 }}
{{- end }}
volumes:
- name: tmp
emptyDir: {}
EKS/AKS/GKE Specific Tips
| Aspect | Amazon EKS | Azure AKS | GCP GKE |
| Network (CNI) | VPC CNI (aws-node) or Calico | Azure CNI or kubenet | Autopilot or Standard VPC-native |
| Load Balancer | ALB (AWS LB Controller) or NLB | Azure Load Balancer + Application Gateway | GCP Load Balancer (Ingress) |
| Managed Node Groups | MNG, Fargate, Karpenter | VMSS node pools, Virtual Nodes (ACI) | Node pools, Autopilot |
| IAM for Pods | IAM Roles for Service Accounts (IRSA) | Azure AD Workload Identity | Workload Identity |
| Storage | EBS CSI, EFS CSI | Azure Disk CSI, Azure Files CSI | GCE PD CSI |
| Secrets | Secrets Manager + External Secrets | Azure Key Vault + CSI driver | Secret Manager + CSI driver |
| Cluster Autoscaling | Karpenter (recommended) or Cluster Autoscaler | Cluster Autoscaler or Virtual Node | Node Auto-provisioning |
| Observability | CloudWatch Container Insights, AMP | Azure Monitor Container Insights | Cloud Monitoring (Stackdriver) |
| Add-ons | EKS Add-ons (VPC CNI, CoreDNS, kube-proxy) | AKS Add-ons (AGIC, Azure Policy) | GKE Add-ons (Config Connector, Backup) |
RBAC in Kubernetes
# RBAC: Grant a service account read-only access to pods in a namespace
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: pod-reader-sa
namespace: production
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-reader
namespace: production
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["pods/log"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: pod-reader-binding
namespace: production
subjects:
- kind: ServiceAccount
name: pod-reader-sa
namespace: production
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.io
Resource Limits and HPA
# HPA manifest
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: payment-service-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: payment-service
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 2
periodSeconds: 120
Debugging Commands
# ββ Pod Debugging ββββββββββββββββββββββββββββββββββββββββ
# Check pod status and events
kubectl describe pod <pod-name> -n <namespace>
# View logs (current container)
kubectl logs <pod-name> -n <namespace>
# View logs (previous container after crash)
kubectl logs <pod-name> -n <namespace> --previous
# Follow logs with timestamps
kubectl logs -f <pod-name> -n <namespace> --timestamps
# Multi-container pod: specify container
kubectl logs <pod-name> -c <container-name> -n <namespace>
# Stream all pods matching label
kubectl logs -f -l app=payment-service -n <namespace> --all-containers
# ββ Interactive Debugging ββββββββββββββββββββββββββββββββ
# Shell into running pod
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
# Run a debug container alongside (ephemeral container)
kubectl debug -it <pod-name> -n <namespace> --image=nicolaka/netshoot --target=<container>
# Copy files to/from pod
kubectl cp ./local-file <pod-name>:/tmp/remote-file -n <namespace>
kubectl cp <pod-name>:/tmp/logs ./local-logs -n <namespace>
# ββ Node Debugging βββββββββββββββββββββββββββββββββββββββ
# Check node resource usage
kubectl top node
# Check node status and conditions
kubectl describe node <node-name>
# Check node events
kubectl get events --field-selector involvedObject.kind=Node
# Cordon/uncordon a node (prevent new scheduling)
kubectl cordon <node-name>
kubectl uncordon <node-name>
# Drain a node (evict pods gracefully)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# ββ Network Debugging ββββββββββββββββββββββββββββββββββββ
# Test connectivity between pods
kubectl run debug --rm -i --restart=Never --image=busybox -- wget -qO- http://service-name.namespace.svc.cluster.local
# Check DNS resolution
kubectl run debug --rm -i --restart=Never --image=busybox -- nslookup kubernetes.default
# Check service endpoints
kubectl get endpoints <service-name> -n <namespace>
# ββ Event Analysis βββββββββββββββββββββββββββββββββββββββ
# Watch all events in real-time
kubectl get events -w --sort-by=.lastTimestamp
# Filter events by type
kubectl get events --field-selector type=Warning
# Filter events by object
kubectl get events --field-selector involvedObject.name=<pod-name>