Kubernetes Operations

Kubernetes is the de facto container orchestration platform. This guide covers cluster operations, deployment patterns, and Helm chart management based on production experience running EKS clusters serving millions of users across Samsung services.

Kubernetes Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Control Plane                             │
│  ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐   │
│  │ API Server│ │   etcd    │ │Scheduler  │ │ Controller│   │
│  │ (kube-apisrv)│(data store)│  (kube-   │ │  Manager  │   │
│  │           │ │           │ │ scheduler)│ │(kube-cm)  │   │
│  │ kubectl → │ │ All cluster│ │ Assigns  │ │ Manages   │   │
│  │ REST API  │ │ state     │ │ pods to  │ │ replicas, │   │
│  └───────────┘ └───────────┘ │ nodes    │ │ services  │   │
│                              └───────────┘ └───────────┘   │
└──────────────────────────┬──────────────────────────────────┘
                           │ (via API Server)
              ┌────────────┼────────────┐
              ▼            ▼            ▼
┌─────────────────┐ ┌───────────┐ ┌─────────────────┐
│   Worker Node 1  │ │ Worker N2 │ │   Worker Node 3  │
│  ┌───────────┐  │ │┌─────────┐│ │  ┌───────────┐  │
│  │ kubelet   │  │ ││kubelet  ││ │  │ kubelet   │  │
│  │(agent)    │  │ ││(agent)  ││ │  │ (agent)   │  │
│  ├───────────┤  │ │├────────┤│ │  ├───────────┤  │
│  │ kube-proxy│  │ ││kube-   ││ │  │ kube-proxy│  │
│  │(network)  │  │ ││proxy   ││ │  │ (network) │  │
│  ├───────────┤  │ │├────────┤│ │  ├───────────┤  │
│  │ Container │  │ ││Container││ │  │ Container │  │
│  │ Runtime   │  │ ││Runtime ││ │  │ Runtime   │  │
│  │(containerd│  │ │(containerd)│ │(containerd│  │
│  ├───────────┤  │ │├────────┤│ │  ├───────────┤  │
│  │ [Pod] [Pod]│  │ │[Pod][Pod]│ │  │ [Pod] [Pod]│  │
│  │ [ c c ]   │  │ │[c c]    ││ │  │ [ c c c ] │  │
│  └───────────┘  │ └─────────┘ │  └───────────┘  │
└─────────────────┘ └───────────┘ └─────────────────┘

Control Plane Components

Component	Function	Failure Impact
API Server (kube-apiserver)	Exposes Kubernetes API; front end for all cluster operations	Cluster unmanageable; existing workloads unaffected
etcd	Distributed key-value store for all cluster state	Complete cluster outage; data loss if unbacked
Scheduler (kube-scheduler)	Assigns pods to nodes based on resources and constraints	New pods not scheduled; existing pods run
Controller Manager (kube-controller-manager)	Runs controllers (replication, endpoints, service account)	Self-healing stops; auto-scaling fails
Cloud Controller Manager	Integrates with cloud provider (AWS, Azure, GCP)	Load balancers, volumes not provisioned

Worker Node Components

Component	Function
kubelet	Agent that ensures containers run as specified in PodSpec
kube-proxy	Maintains network rules and connection forwarding
Container Runtime	Executes containers (containerd, CRI-O)

Key Resources

Resource	Purpose	Key Fields
Pod	Smallest deployable unit; contains one or more containers	containers, volumes, restartPolicy
Deployment	Manages Pod replicas; supports rolling updates	replicas, strategy, selector, template
StatefulSet	Manages stateful apps with stable network identity and storage	serviceName, volumeClaimTemplates, podManagementPolicy
DaemonSet	Ensures one pod per node (logging, monitoring agents)	nodeSelector, tolerations
Service	Exposes pods via stable IP/DNS (ClusterIP, NodePort, LoadBalancer)	selector, ports, type
Ingress	HTTP/HTTPS routing rules to services	rules, tls, annotations (ingress class)
ConfigMap	Non-sensitive configuration data	data, binaryData
Secret	Sensitive data (passwords, tokens, keys)	type (Opaque, tls, docker-registry), data (base64)
PersistentVolume	Storage resource in the cluster	capacity, accessModes, storageClassName
PersistentVolumeClaim	Request for storage by a pod	resources.requests.storage, storageClassName
ServiceAccount	Identity for pods accessing the API	automountServiceAccountToken
Role / ClusterRole	Permission definitions (RBAC)	rules (apiGroups, resources, verbs)
RoleBinding / ClusterRoleBinding	Assigns roles to subjects	subjects, roleRef
NetworkPolicy	Pod-level firewall rules	podSelector, policyTypes, ingress, egress
HorizontalPodAutoscaler	Auto-scales pods based on CPU/memory/custom metrics	scaleTargetRef, minReplicas, maxReplicas, metrics

kubectl Command Reference

Command	Description
`kubectl get pods -n <ns>`	List pods in a namespace
`kubectl get pods -o wide`	List pods with node and IP details
`kubectl get all -n <ns>`	List all resources in namespace
`kubectl describe pod <name>`	Detailed pod information and events
`kubectl logs <pod> -c <container>`	View container logs
`kubectl logs <pod> --tail=100 -f`	Follow last 100 log lines
`kubectl logs <pod> --previous`	Logs from previous container instance (crash debugging)
`kubectl exec -it <pod> -- /bin/sh`	Interactive shell in running pod
`kubectl apply -f manifest.yaml`	Apply configuration from file
`kubectl apply -f directory/`	Apply all manifests in directory
`kubectl delete -f manifest.yaml`	Delete resources from manifest
`kubectl port-forward pod/<name> 8080:80`	Forward local port to pod
`kubectl top pod -n <ns>`	Show resource usage (requires metrics-server)
`kubectl get events --sort-by=.lastTimestamp`	View cluster events sorted by time
`kubectl rollout status deployment/<name>`	Check rollout progress
`kubectl rollout history deployment/<name>`	View rollout history
`kubectl rollout undo deployment/<name>`	Rollback to previous revision
`kubectl get nodes -o custom-columns=...`	Custom output formatting
`kubectl config use-context <context>`	Switch kubeconfig context
`kubectl get ns`	List namespaces
`kubectl create ns <name>`	Create namespace
`kubectl get svc -n <ns>`	List services
`kubectl get ingress -n <ns>`	List ingresses
`kubectl get pv,pvc -n <ns>`	List persistent volumes and claims
`kubectl auth can-i <verb> <resource>`	Check RBAC permissions
`kubectl explain pod.spec`	Get API documentation for resource field

Complete Deployment + Service YAML Example

# manifests/web-app.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
  namespace: production
  labels:
    app: payment-service
    version: v1.2.3
    tier: backend
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 0
  selector:
    matchLabels:
      app: payment-service
  template:
    metadata:
      labels:
        app: payment-service
        version: v1.2.3
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        prometheus.io/path: "/metrics"
    spec:
      serviceAccountName: payment-service-sa
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
      containers:
        - name: payment-service
          image: "123456789012.dkr.ecr.us-east-1.amazonaws.com/payment-service:v1.2.3"
          imagePullPolicy: Always
          ports:
            - name: http
              containerPort: 8080
              protocol: TCP
            - name: metrics
              containerPort: 9090
              protocol: TCP
          env:
            - name: NODE_ENV
              value: "production"
            - name: PORT
              value: "8080"
            - name: DB_HOST
              valueFrom:
                secretKeyRef:
                  name: payment-service-secrets
                  key: db_host
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: payment-service-secrets
                  key: db_password
            - name: REDIS_URL
              valueFrom:
                configMapKeyRef:
                  name: payment-service-config
                  key: redis_url
          resources:
            requests:
              memory: "256Mi"
              cpu: "250m"
            limits:
              memory: "512Mi"
              cpu: "1000m"
          livenessProbe:
            httpGet:
              path: /health/live
              port: http
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /health/ready
              port: http
            initialDelaySeconds: 5
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 2
          volumeMounts:
            - name: tmp
              mountPath: /tmp
      volumes:
        - name: tmp
          emptyDir: {}
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: payment-service
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - payment-service
                topologyKey: kubernetes.io/hostname
---
apiVersion: v1
kind: Service
metadata:
  name: payment-service
  namespace: production
  labels:
    app: payment-service
spec:
  type: ClusterIP
  selector:
    app: payment-service
  ports:
    - name: http
      port: 80
      targetPort: 8080
      protocol: TCP
    - name: metrics
      port: 9090
      targetPort: 9090
      protocol: TCP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: payment-service
  namespace: production
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789012:certificate/abcd1234
    alb.ingress.kubernetes.io/healthcheck-path: /health/ready
spec:
  ingressClassName: alb
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /payments
            pathType: Prefix
            backend:
              service:
                name: payment-service
                port:
                  number: 80

Helm Charts: Structure, values.yaml, Templating

Helm is the package manager for Kubernetes. Charts encapsulate Kubernetes manifests with templating for configuration:

Chart Structure

payment-service-chart/
├── Chart.yaml           # Chart metadata (name, version, dependencies)
├── values.yaml          # Default configuration values
├── values-production.yaml  # Environment-specific overrides
├── .helmignore          # Files to exclude from packaging
├── charts/              # Sub-charts (dependencies)
├── templates/
│   ├── _helpers.tpl     # Named template definitions
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── ingress.yaml
│   ├── hpa.yaml         # Horizontal Pod Autoscaler
│   ├── serviceaccount.yaml
│   ├── secret.yaml
│   ├── configmap.yaml
│   ├── pdb.yaml         # Pod Disruption Budget
│   └── NOTES.txt        # Post-install instructions
└── tests/
    └── test-connection.yaml  # Helm test pod

Complete Helm Chart Example

# payment-service-chart/Chart.yaml
apiVersion: v2
name: payment-service
description: Payment service Helm chart
type: application
version: 1.2.3
appVersion: "1.2.3"
kubeVersion: ">=1.27.0"
keywords:
  - payment
  - backend
maintainers:
  - name: Platform Team
    email: platform@example.com
dependencies:
  - name: redis
    version: "~> 18.0"
    repository: "https://charts.bitnami.com/bitnami"
    condition: redis.enabled

# payment-service-chart/values.yaml
# ── Global ───────────────────────────────────────────────
nameOverride: ""
fullnameOverride: ""

# ── Image ────────────────────────────────────────────────
image:
  repository: 123456789012.dkr.ecr.us-east-1.amazonaws.com/payment-service
  pullPolicy: Always
  tag: ""  # Defaults to Chart appVersion

imagePullSecrets:
  - name: ecr-registry-secret

# ── Replicas ─────────────────────────────────────────────
replicaCount: 2

# ── Deployment Strategy ──────────────────────────────────
strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 25%
    maxUnavailable: 0

# ── Service Account ──────────────────────────────────────
serviceAccount:
  create: true
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/payment-service-role
  name: ""

# ── Pod Security Context ─────────────────────────────────
podSecurityContext:
  runAsNonRoot: true
  runAsUser: 1000
  runAsGroup: 1000
  fsGroup: 1000

securityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop:
      - ALL
  seccompProfile:
    type: RuntimeDefault

# ── Service ──────────────────────────────────────────────
service:
  type: ClusterIP
  port: 80
  targetPort: 8080

# ── Ingress ──────────────────────────────────────────────
ingress:
  enabled: true
  className: alb
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    alb.ingress.kubernetes.io/certificate-arn: ""
    alb.ingress.kubernetes.io/healthcheck-path: /health/ready
  hosts:
    - host: api.example.com
      paths:
        - path: /payments
          pathType: Prefix
  tls: []

# ── Resources ────────────────────────────────────────────
resources:
  limits:
    cpu: 1000m
    memory: 512Mi
  requests:
    cpu: 250m
    memory: 256Mi

# ── Autoscaling ──────────────────────────────────────────
autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

# ── Probes ───────────────────────────────────────────────
livenessProbe:
  httpGet:
    path: /health/live
    port: http
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health/ready
    port: http
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 2

# ── Pod Disruption Budget ────────────────────────────────
podDisruptionBudget:
  enabled: true
  minAvailable: 1

# ── Topology Spread ──────────────────────────────────────
topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: ScheduleAnyway

# ── Node Selector, Tolerations, Affinity ─────────────────
nodeSelector: {}
tolerations: []
affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app.kubernetes.io/name: payment-service
          topologyKey: kubernetes.io/hostname

# ── ConfigMap ────────────────────────────────────────────
configMap:
  data:
    NODE_ENV: "production"
    PORT: "8080"
    LOG_LEVEL: "info"
    REDIS_URL: "redis://payment-service-redis-master:6379"

# ── Secrets ──────────────────────────────────────────────
secrets:
  db_host: "payment-db.cluster-xxx.us-east-1.rds.amazonaws.com"
  # db_password is injected via external-secrets operator

# ── Redis Dependency ─────────────────────────────────────
redis:
  enabled: true
  architecture: standalone
  auth:
    enabled: false

# templates/deployment.yaml
{{- $fullName := include "payment-service.fullname" . -}}
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ $fullName }}
  labels:
    {{- include "payment-service.labels" . | nindent 4 }}
    app.kubernetes.io/version: {{ .Values.image.tag | default .Chart.AppVersion | quote }}
spec:
  {{- if not .Values.autoscaling.enabled }}
  replicas: {{ .Values.replicaCount }}
  {{- end }}
  strategy:
    {{- toYaml .Values.strategy | nindent 4 }}
  selector:
    matchLabels:
      {{- include "payment-service.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      labels:
        {{- include "payment-service.selectorLabels" . | nindent 8 }}
        app.kubernetes.io/version: {{ .Values.image.tag | default .Chart.AppVersion | quote }}
      annotations:
        checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
        checksum/secrets: {{ include (print $.Template.BasePath "/secret.yaml") . | sha256sum }}
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        rollme: {{ randAlphaNum 5 | quote }}  # Force rollout on every deploy
    spec:
      serviceAccountName: {{ include "payment-service.serviceAccountName" . }}
      securityContext:
        {{- toYaml .Values.podSecurityContext | nindent 8 }}
      containers:
        - name: {{ .Chart.Name }}
          securityContext:
            {{- toYaml .Values.securityContext | nindent 12 }}
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          ports:
            - name: http
              containerPort: {{ .Values.service.targetPort }}
              protocol: TCP
          envFrom:
            - configMapRef:
                name: {{ $fullName }}-config
            - secretRef:
                name: {{ $fullName }}-secrets
          livenessProbe:
            {{- toYaml .Values.livenessProbe | nindent 12 }}
          readinessProbe:
            {{- toYaml .Values.readinessProbe | nindent 12 }}
          resources:
            {{- toYaml .Values.resources | nindent 12 }}
          volumeMounts:
            - name: tmp
              mountPath: /tmp
      {{- with .Values.topologySpreadConstraints }}
      topologySpreadConstraints:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      {{- with .Values.affinity }}
      affinity:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      volumes:
        - name: tmp
          emptyDir: {}

EKS/AKS/GKE Specific Tips

Aspect	Amazon EKS	Azure AKS	GCP GKE
Network (CNI)	VPC CNI (aws-node) or Calico	Azure CNI or kubenet	Autopilot or Standard VPC-native
Load Balancer	ALB (AWS LB Controller) or NLB	Azure Load Balancer + Application Gateway	GCP Load Balancer (Ingress)
Managed Node Groups	MNG, Fargate, Karpenter	VMSS node pools, Virtual Nodes (ACI)	Node pools, Autopilot
IAM for Pods	IAM Roles for Service Accounts (IRSA)	Azure AD Workload Identity	Workload Identity
Storage	EBS CSI, EFS CSI	Azure Disk CSI, Azure Files CSI	GCE PD CSI
Secrets	Secrets Manager + External Secrets	Azure Key Vault + CSI driver	Secret Manager + CSI driver
Cluster Autoscaling	Karpenter (recommended) or Cluster Autoscaler	Cluster Autoscaler or Virtual Node	Node Auto-provisioning
Observability	CloudWatch Container Insights, AMP	Azure Monitor Container Insights	Cloud Monitoring (Stackdriver)
Add-ons	EKS Add-ons (VPC CNI, CoreDNS, kube-proxy)	AKS Add-ons (AGIC, Azure Policy)	GKE Add-ons (Config Connector, Backup)

RBAC in Kubernetes

# RBAC: Grant a service account read-only access to pods in a namespace
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: pod-reader-sa
  namespace: production
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
  namespace: production
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["pods/log"]
    verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: pod-reader-binding
  namespace: production
subjects:
  - kind: ServiceAccount
    name: pod-reader-sa
    namespace: production
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

Resource Limits and HPA

# HPA manifest
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payment-service-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 2
          periodSeconds: 120

Debugging Commands

# ── Pod Debugging ────────────────────────────────────────
# Check pod status and events
kubectl describe pod <pod-name> -n <namespace>

# View logs (current container)
kubectl logs <pod-name> -n <namespace>

# View logs (previous container after crash)
kubectl logs <pod-name> -n <namespace> --previous

# Follow logs with timestamps
kubectl logs -f <pod-name> -n <namespace> --timestamps

# Multi-container pod: specify container
kubectl logs <pod-name> -c <container-name> -n <namespace>

# Stream all pods matching label
kubectl logs -f -l app=payment-service -n <namespace> --all-containers

# ── Interactive Debugging ────────────────────────────────
# Shell into running pod
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

# Run a debug container alongside (ephemeral container)
kubectl debug -it <pod-name> -n <namespace> --image=nicolaka/netshoot --target=<container>

# Copy files to/from pod
kubectl cp ./local-file <pod-name>:/tmp/remote-file -n <namespace>
kubectl cp <pod-name>:/tmp/logs ./local-logs -n <namespace>

# ── Node Debugging ───────────────────────────────────────
# Check node resource usage
kubectl top node

# Check node status and conditions
kubectl describe node <node-name>

# Check node events
kubectl get events --field-selector involvedObject.kind=Node

# Cordon/uncordon a node (prevent new scheduling)
kubectl cordon <node-name>
kubectl uncordon <node-name>

# Drain a node (evict pods gracefully)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# ── Network Debugging ────────────────────────────────────
# Test connectivity between pods
kubectl run debug --rm -i --restart=Never --image=busybox -- wget -qO- http://service-name.namespace.svc.cluster.local

# Check DNS resolution
kubectl run debug --rm -i --restart=Never --image=busybox -- nslookup kubernetes.default

# Check service endpoints
kubectl get endpoints <service-name> -n <namespace>

# ── Event Analysis ───────────────────────────────────────
# Watch all events in real-time
kubectl get events -w --sort-by=.lastTimestamp

# Filter events by type
kubectl get events --field-selector type=Warning

# Filter events by object
kubectl get events --field-selector involvedObject.name=<pod-name>