41 pages Β· 8 sections
Ctrl K
GitHub Portfolio

Kubernetes Operations

Kubernetes is the de facto container orchestration platform. This guide covers cluster operations, deployment patterns, and Helm chart management based on production experience running EKS clusters serving millions of users across Samsung services.

Kubernetes Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Control Plane                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ API Serverβ”‚ β”‚   etcd    β”‚ β”‚Scheduler  β”‚ β”‚ Controllerβ”‚   β”‚
β”‚  β”‚ (kube-apisrv)β”‚(data store)β”‚  (kube-   β”‚ β”‚  Manager  β”‚   β”‚
β”‚  β”‚           β”‚ β”‚           β”‚ β”‚ scheduler)β”‚ β”‚(kube-cm)  β”‚   β”‚
β”‚  β”‚ kubectl β†’ β”‚ β”‚ All clusterβ”‚ β”‚ Assigns  β”‚ β”‚ Manages   β”‚   β”‚
β”‚  β”‚ REST API  β”‚ β”‚ state     β”‚ β”‚ pods to  β”‚ β”‚ replicas, β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ nodes    β”‚ β”‚ services  β”‚   β”‚
β”‚                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚ (via API Server)
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β–Ό            β–Ό            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Worker Node 1  β”‚ β”‚ Worker N2 β”‚ β”‚   Worker Node 3  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚ β”‚β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚ β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ kubelet   β”‚  β”‚ β”‚β”‚kubelet  β”‚β”‚ β”‚  β”‚ kubelet   β”‚  β”‚
β”‚  β”‚(agent)    β”‚  β”‚ β”‚β”‚(agent)  β”‚β”‚ β”‚  β”‚ (agent)   β”‚  β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”‚ β”‚β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”‚ β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”‚
β”‚  β”‚ kube-proxyβ”‚  β”‚ β”‚β”‚kube-   β”‚β”‚ β”‚  β”‚ kube-proxyβ”‚  β”‚
β”‚  β”‚(network)  β”‚  β”‚ β”‚β”‚proxy   β”‚β”‚ β”‚  β”‚ (network) β”‚  β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”‚ β”‚β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”‚ β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”‚
β”‚  β”‚ Container β”‚  β”‚ β”‚β”‚Containerβ”‚β”‚ β”‚  β”‚ Container β”‚  β”‚
β”‚  β”‚ Runtime   β”‚  β”‚ β”‚β”‚Runtime β”‚β”‚ β”‚  β”‚ Runtime   β”‚  β”‚
β”‚  β”‚(containerdβ”‚  β”‚ β”‚(containerd)β”‚ β”‚(containerdβ”‚  β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”‚ β”‚β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”‚ β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”‚
β”‚  β”‚ [Pod] [Pod]β”‚  β”‚ β”‚[Pod][Pod]β”‚ β”‚  β”‚ [Pod] [Pod]β”‚  β”‚
β”‚  β”‚ [ c c ]   β”‚  β”‚ β”‚[c c]    β”‚β”‚ β”‚  β”‚ [ c c c ] β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Control Plane Components

ComponentFunctionFailure Impact
API Server (kube-apiserver)Exposes Kubernetes API; front end for all cluster operationsCluster unmanageable; existing workloads unaffected
etcdDistributed key-value store for all cluster stateComplete cluster outage; data loss if unbacked
Scheduler (kube-scheduler)Assigns pods to nodes based on resources and constraintsNew pods not scheduled; existing pods run
Controller Manager (kube-controller-manager)Runs controllers (replication, endpoints, service account)Self-healing stops; auto-scaling fails
Cloud Controller ManagerIntegrates with cloud provider (AWS, Azure, GCP)Load balancers, volumes not provisioned

Worker Node Components

ComponentFunction
kubeletAgent that ensures containers run as specified in PodSpec
kube-proxyMaintains network rules and connection forwarding
Container RuntimeExecutes containers (containerd, CRI-O)

Key Resources

ResourcePurposeKey Fields
PodSmallest deployable unit; contains one or more containerscontainers, volumes, restartPolicy
DeploymentManages Pod replicas; supports rolling updatesreplicas, strategy, selector, template
StatefulSetManages stateful apps with stable network identity and storageserviceName, volumeClaimTemplates, podManagementPolicy
DaemonSetEnsures one pod per node (logging, monitoring agents)nodeSelector, tolerations
ServiceExposes pods via stable IP/DNS (ClusterIP, NodePort, LoadBalancer)selector, ports, type
IngressHTTP/HTTPS routing rules to servicesrules, tls, annotations (ingress class)
ConfigMapNon-sensitive configuration datadata, binaryData
SecretSensitive data (passwords, tokens, keys)type (Opaque, tls, docker-registry), data (base64)
PersistentVolumeStorage resource in the clustercapacity, accessModes, storageClassName
PersistentVolumeClaimRequest for storage by a podresources.requests.storage, storageClassName
ServiceAccountIdentity for pods accessing the APIautomountServiceAccountToken
Role / ClusterRolePermission definitions (RBAC)rules (apiGroups, resources, verbs)
RoleBinding / ClusterRoleBindingAssigns roles to subjectssubjects, roleRef
NetworkPolicyPod-level firewall rulespodSelector, policyTypes, ingress, egress
HorizontalPodAutoscalerAuto-scales pods based on CPU/memory/custom metricsscaleTargetRef, minReplicas, maxReplicas, metrics

kubectl Command Reference

CommandDescription
kubectl get pods -n <ns>List pods in a namespace
kubectl get pods -o wideList pods with node and IP details
kubectl get all -n <ns>List all resources in namespace
kubectl describe pod <name>Detailed pod information and events
kubectl logs <pod> -c <container>View container logs
kubectl logs <pod> --tail=100 -fFollow last 100 log lines
kubectl logs <pod> --previousLogs from previous container instance (crash debugging)
kubectl exec -it <pod> -- /bin/shInteractive shell in running pod
kubectl apply -f manifest.yamlApply configuration from file
kubectl apply -f directory/Apply all manifests in directory
kubectl delete -f manifest.yamlDelete resources from manifest
kubectl port-forward pod/<name> 8080:80Forward local port to pod
kubectl top pod -n <ns>Show resource usage (requires metrics-server)
kubectl get events --sort-by=.lastTimestampView cluster events sorted by time
kubectl rollout status deployment/<name>Check rollout progress
kubectl rollout history deployment/<name>View rollout history
kubectl rollout undo deployment/<name>Rollback to previous revision
kubectl get nodes -o custom-columns=...Custom output formatting
kubectl config use-context <context>Switch kubeconfig context
kubectl get nsList namespaces
kubectl create ns <name>Create namespace
kubectl get svc -n <ns>List services
kubectl get ingress -n <ns>List ingresses
kubectl get pv,pvc -n <ns>List persistent volumes and claims
kubectl auth can-i <verb> <resource>Check RBAC permissions
kubectl explain pod.specGet API documentation for resource field

Complete Deployment + Service YAML Example

# manifests/web-app.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
  namespace: production
  labels:
    app: payment-service
    version: v1.2.3
    tier: backend
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 0
  selector:
    matchLabels:
      app: payment-service
  template:
    metadata:
      labels:
        app: payment-service
        version: v1.2.3
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        prometheus.io/path: "/metrics"
    spec:
      serviceAccountName: payment-service-sa
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
      containers:
        - name: payment-service
          image: "123456789012.dkr.ecr.us-east-1.amazonaws.com/payment-service:v1.2.3"
          imagePullPolicy: Always
          ports:
            - name: http
              containerPort: 8080
              protocol: TCP
            - name: metrics
              containerPort: 9090
              protocol: TCP
          env:
            - name: NODE_ENV
              value: "production"
            - name: PORT
              value: "8080"
            - name: DB_HOST
              valueFrom:
                secretKeyRef:
                  name: payment-service-secrets
                  key: db_host
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: payment-service-secrets
                  key: db_password
            - name: REDIS_URL
              valueFrom:
                configMapKeyRef:
                  name: payment-service-config
                  key: redis_url
          resources:
            requests:
              memory: "256Mi"
              cpu: "250m"
            limits:
              memory: "512Mi"
              cpu: "1000m"
          livenessProbe:
            httpGet:
              path: /health/live
              port: http
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /health/ready
              port: http
            initialDelaySeconds: 5
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 2
          volumeMounts:
            - name: tmp
              mountPath: /tmp
      volumes:
        - name: tmp
          emptyDir: {}
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: payment-service
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - payment-service
                topologyKey: kubernetes.io/hostname
---
apiVersion: v1
kind: Service
metadata:
  name: payment-service
  namespace: production
  labels:
    app: payment-service
spec:
  type: ClusterIP
  selector:
    app: payment-service
  ports:
    - name: http
      port: 80
      targetPort: 8080
      protocol: TCP
    - name: metrics
      port: 9090
      targetPort: 9090
      protocol: TCP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: payment-service
  namespace: production
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789012:certificate/abcd1234
    alb.ingress.kubernetes.io/healthcheck-path: /health/ready
spec:
  ingressClassName: alb
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /payments
            pathType: Prefix
            backend:
              service:
                name: payment-service
                port:
                  number: 80

Helm Charts: Structure, values.yaml, Templating

Helm is the package manager for Kubernetes. Charts encapsulate Kubernetes manifests with templating for configuration:

Chart Structure

payment-service-chart/
β”œβ”€β”€ Chart.yaml           # Chart metadata (name, version, dependencies)
β”œβ”€β”€ values.yaml          # Default configuration values
β”œβ”€β”€ values-production.yaml  # Environment-specific overrides
β”œβ”€β”€ .helmignore          # Files to exclude from packaging
β”œβ”€β”€ charts/              # Sub-charts (dependencies)
β”œβ”€β”€ templates/
β”‚   β”œβ”€β”€ _helpers.tpl     # Named template definitions
β”‚   β”œβ”€β”€ deployment.yaml
β”‚   β”œβ”€β”€ service.yaml
β”‚   β”œβ”€β”€ ingress.yaml
β”‚   β”œβ”€β”€ hpa.yaml         # Horizontal Pod Autoscaler
β”‚   β”œβ”€β”€ serviceaccount.yaml
β”‚   β”œβ”€β”€ secret.yaml
β”‚   β”œβ”€β”€ configmap.yaml
β”‚   β”œβ”€β”€ pdb.yaml         # Pod Disruption Budget
β”‚   └── NOTES.txt        # Post-install instructions
└── tests/
    └── test-connection.yaml  # Helm test pod

Complete Helm Chart Example

# payment-service-chart/Chart.yaml
apiVersion: v2
name: payment-service
description: Payment service Helm chart
type: application
version: 1.2.3
appVersion: "1.2.3"
kubeVersion: ">=1.27.0"
keywords:
  - payment
  - backend
maintainers:
  - name: Platform Team
    email: platform@example.com
dependencies:
  - name: redis
    version: "~> 18.0"
    repository: "https://charts.bitnami.com/bitnami"
    condition: redis.enabled
# payment-service-chart/values.yaml
# ── Global ───────────────────────────────────────────────
nameOverride: ""
fullnameOverride: ""

# ── Image ────────────────────────────────────────────────
image:
  repository: 123456789012.dkr.ecr.us-east-1.amazonaws.com/payment-service
  pullPolicy: Always
  tag: ""  # Defaults to Chart appVersion

imagePullSecrets:
  - name: ecr-registry-secret

# ── Replicas ─────────────────────────────────────────────
replicaCount: 2

# ── Deployment Strategy ──────────────────────────────────
strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 25%
    maxUnavailable: 0

# ── Service Account ──────────────────────────────────────
serviceAccount:
  create: true
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/payment-service-role
  name: ""

# ── Pod Security Context ─────────────────────────────────
podSecurityContext:
  runAsNonRoot: true
  runAsUser: 1000
  runAsGroup: 1000
  fsGroup: 1000

securityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop:
      - ALL
  seccompProfile:
    type: RuntimeDefault

# ── Service ──────────────────────────────────────────────
service:
  type: ClusterIP
  port: 80
  targetPort: 8080

# ── Ingress ──────────────────────────────────────────────
ingress:
  enabled: true
  className: alb
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    alb.ingress.kubernetes.io/certificate-arn: ""
    alb.ingress.kubernetes.io/healthcheck-path: /health/ready
  hosts:
    - host: api.example.com
      paths:
        - path: /payments
          pathType: Prefix
  tls: []

# ── Resources ────────────────────────────────────────────
resources:
  limits:
    cpu: 1000m
    memory: 512Mi
  requests:
    cpu: 250m
    memory: 256Mi

# ── Autoscaling ──────────────────────────────────────────
autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

# ── Probes ───────────────────────────────────────────────
livenessProbe:
  httpGet:
    path: /health/live
    port: http
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health/ready
    port: http
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 2

# ── Pod Disruption Budget ────────────────────────────────
podDisruptionBudget:
  enabled: true
  minAvailable: 1

# ── Topology Spread ──────────────────────────────────────
topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: ScheduleAnyway

# ── Node Selector, Tolerations, Affinity ─────────────────
nodeSelector: {}
tolerations: []
affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app.kubernetes.io/name: payment-service
          topologyKey: kubernetes.io/hostname

# ── ConfigMap ────────────────────────────────────────────
configMap:
  data:
    NODE_ENV: "production"
    PORT: "8080"
    LOG_LEVEL: "info"
    REDIS_URL: "redis://payment-service-redis-master:6379"

# ── Secrets ──────────────────────────────────────────────
secrets:
  db_host: "payment-db.cluster-xxx.us-east-1.rds.amazonaws.com"
  # db_password is injected via external-secrets operator

# ── Redis Dependency ─────────────────────────────────────
redis:
  enabled: true
  architecture: standalone
  auth:
    enabled: false
# templates/deployment.yaml
{{- $fullName := include "payment-service.fullname" . -}}
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ $fullName }}
  labels:
    {{- include "payment-service.labels" . | nindent 4 }}
    app.kubernetes.io/version: {{ .Values.image.tag | default .Chart.AppVersion | quote }}
spec:
  {{- if not .Values.autoscaling.enabled }}
  replicas: {{ .Values.replicaCount }}
  {{- end }}
  strategy:
    {{- toYaml .Values.strategy | nindent 4 }}
  selector:
    matchLabels:
      {{- include "payment-service.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      labels:
        {{- include "payment-service.selectorLabels" . | nindent 8 }}
        app.kubernetes.io/version: {{ .Values.image.tag | default .Chart.AppVersion | quote }}
      annotations:
        checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
        checksum/secrets: {{ include (print $.Template.BasePath "/secret.yaml") . | sha256sum }}
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        rollme: {{ randAlphaNum 5 | quote }}  # Force rollout on every deploy
    spec:
      serviceAccountName: {{ include "payment-service.serviceAccountName" . }}
      securityContext:
        {{- toYaml .Values.podSecurityContext | nindent 8 }}
      containers:
        - name: {{ .Chart.Name }}
          securityContext:
            {{- toYaml .Values.securityContext | nindent 12 }}
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          ports:
            - name: http
              containerPort: {{ .Values.service.targetPort }}
              protocol: TCP
          envFrom:
            - configMapRef:
                name: {{ $fullName }}-config
            - secretRef:
                name: {{ $fullName }}-secrets
          livenessProbe:
            {{- toYaml .Values.livenessProbe | nindent 12 }}
          readinessProbe:
            {{- toYaml .Values.readinessProbe | nindent 12 }}
          resources:
            {{- toYaml .Values.resources | nindent 12 }}
          volumeMounts:
            - name: tmp
              mountPath: /tmp
      {{- with .Values.topologySpreadConstraints }}
      topologySpreadConstraints:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      {{- with .Values.affinity }}
      affinity:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      volumes:
        - name: tmp
          emptyDir: {}

EKS/AKS/GKE Specific Tips

AspectAmazon EKSAzure AKSGCP GKE
Network (CNI)VPC CNI (aws-node) or CalicoAzure CNI or kubenetAutopilot or Standard VPC-native
Load BalancerALB (AWS LB Controller) or NLBAzure Load Balancer + Application GatewayGCP Load Balancer (Ingress)
Managed Node GroupsMNG, Fargate, KarpenterVMSS node pools, Virtual Nodes (ACI)Node pools, Autopilot
IAM for PodsIAM Roles for Service Accounts (IRSA)Azure AD Workload IdentityWorkload Identity
StorageEBS CSI, EFS CSIAzure Disk CSI, Azure Files CSIGCE PD CSI
SecretsSecrets Manager + External SecretsAzure Key Vault + CSI driverSecret Manager + CSI driver
Cluster AutoscalingKarpenter (recommended) or Cluster AutoscalerCluster Autoscaler or Virtual NodeNode Auto-provisioning
ObservabilityCloudWatch Container Insights, AMPAzure Monitor Container InsightsCloud Monitoring (Stackdriver)
Add-onsEKS Add-ons (VPC CNI, CoreDNS, kube-proxy)AKS Add-ons (AGIC, Azure Policy)GKE Add-ons (Config Connector, Backup)

RBAC in Kubernetes

# RBAC: Grant a service account read-only access to pods in a namespace
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: pod-reader-sa
  namespace: production
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
  namespace: production
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["pods/log"]
    verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: pod-reader-binding
  namespace: production
subjects:
  - kind: ServiceAccount
    name: pod-reader-sa
    namespace: production
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

Resource Limits and HPA

# HPA manifest
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payment-service-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 2
          periodSeconds: 120

Debugging Commands

# ── Pod Debugging ────────────────────────────────────────
# Check pod status and events
kubectl describe pod <pod-name> -n <namespace>

# View logs (current container)
kubectl logs <pod-name> -n <namespace>

# View logs (previous container after crash)
kubectl logs <pod-name> -n <namespace> --previous

# Follow logs with timestamps
kubectl logs -f <pod-name> -n <namespace> --timestamps

# Multi-container pod: specify container
kubectl logs <pod-name> -c <container-name> -n <namespace>

# Stream all pods matching label
kubectl logs -f -l app=payment-service -n <namespace> --all-containers

# ── Interactive Debugging ────────────────────────────────
# Shell into running pod
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

# Run a debug container alongside (ephemeral container)
kubectl debug -it <pod-name> -n <namespace> --image=nicolaka/netshoot --target=<container>

# Copy files to/from pod
kubectl cp ./local-file <pod-name>:/tmp/remote-file -n <namespace>
kubectl cp <pod-name>:/tmp/logs ./local-logs -n <namespace>

# ── Node Debugging ───────────────────────────────────────
# Check node resource usage
kubectl top node

# Check node status and conditions
kubectl describe node <node-name>

# Check node events
kubectl get events --field-selector involvedObject.kind=Node

# Cordon/uncordon a node (prevent new scheduling)
kubectl cordon <node-name>
kubectl uncordon <node-name>

# Drain a node (evict pods gracefully)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# ── Network Debugging ────────────────────────────────────
# Test connectivity between pods
kubectl run debug --rm -i --restart=Never --image=busybox -- wget -qO- http://service-name.namespace.svc.cluster.local

# Check DNS resolution
kubectl run debug --rm -i --restart=Never --image=busybox -- nslookup kubernetes.default

# Check service endpoints
kubectl get endpoints <service-name> -n <namespace>

# ── Event Analysis ───────────────────────────────────────
# Watch all events in real-time
kubectl get events -w --sort-by=.lastTimestamp

# Filter events by type
kubectl get events --field-selector type=Warning

# Filter events by object
kubectl get events --field-selector involvedObject.name=<pod-name>