DevOps Learning Journey: From Concepts to Production

A beginner's guide to understanding DevOps, Infrastructure as Code, and lessons learned from building a production-ready Kubernetes infrastructure.


What is DevOps?

The Traditional Problem

In traditional software development:

  • Developers write code on their laptops
  • Operations deploy and maintain servers
  • These teams work in silos (separately)
  • Deployments are manual, slow, and error-prone
  • "It works on my machine" becomes a common problem

The DevOps Solution

DevOps combines Development + Operations into a unified workflow:

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│  Developer  │────▶│  Automation  │────▶│ Production  │
│   writes    │     │   Pipeline   │     │   Server    │
│    code     │     │              │     │             │
└─────────────┘     └──────────────┘     └─────────────┘
                           │
                           │ Tests, Builds,
                           │ Deploys automatically
                           │

Key Principles:

  • Automation: Automate everything (testing, building, deploying)
  • Version Control: Store everything in Git (code, infrastructure, configs)
  • Continuous Integration/Deployment (CI/CD): Deploy frequently and safely
  • Monitoring: Know what's happening in production
  • Collaboration: Developers and operations work together

Why DevOps Matters

Without DevOpsWith DevOps
Manual deployments taking hours/daysAutomated deployments in minutes
Configuration drift (servers differ)Consistent, reproducible infrastructure
"Works on my machine" problemsIdentical dev/staging/prod environments
Fear of deploying (might break things)Confidence through automation and testing
Slow feedback loopsFast feedback and rapid iteration

Core DevOps Concepts

1. Infrastructure as Code (IaC)

Traditional Approach:

1. Log into AWS console
2. Click "Create EC2 instance"
3. Choose settings manually
4. Repeat for each server
5. Hope you remember what you did

IaC Approach:

# infrastructure.tf
resource "aws_instance" "web_server" {
  ami           = "ami-12345"
  instance_type = "t3.small"

  tags = {
    Name = "web-server-1"
  }
}

Run terraform apply and it creates the server. Same code = same infrastructure every time.

2. Immutable Infrastructure

Old Way (Mutable):

  • Create a server
  • SSH in and install software
  • Update it over time
  • Each server becomes unique (snowflake)

New Way (Immutable):

  • Define server configuration in code
  • Build server image
  • Deploy new servers, delete old ones
  • Every server is identical and replaceable

3. Configuration Management

Tools like:

  • Terraform: Creates infrastructure (servers, networks, databases)
  • Ansible/Chef/Puppet: Configures servers after creation
  • Docker: Packages applications with dependencies
  • Kubernetes: Orchestrates containers at scale

4. CI/CD Pipeline

Developer commits code to Git
         ↓
    Automated tests run
         ↓
    Build Docker image
         ↓
    Deploy to staging
         ↓
    Integration tests
         ↓
    Deploy to production

Infrastructure as Code (IaC)

What is IaC?

Infrastructure as Code means writing code to manage and provision infrastructure instead of manual processes.

Benefits of IaC

Collaboration: Team can review infrastructure changes

# Pull request workflow
git checkout -b add-monitoring
# Make changes
git push
# Create PR for team review

Testing: Test infrastructure before deploying

terraform plan  # Preview changes before applying

Documentation: Code IS the documentation

# This clearly shows we have 3 control plane nodes
count = 3

Reproducibility: Create identical environments

terraform apply  # Creates exact same infrastructure

Version Control: Track every infrastructure change in Git

git log infrastructure.tf
# See who changed what and when

IaC Tools Comparison

ToolPurposeDeclarative/ImperativeBest For
TerraformInfrastructure provisioningDeclarativeMulti-cloud infrastructure
CloudFormationAWS infrastructureDeclarativeAWS-only projects
AnsibleConfiguration managementBothServer configuration
PulumiInfrastructure provisioningImperative (real code)Developers who prefer code

Declarative vs Imperative:

Imperative (How): Tell the computer HOW to do something

# Step by step instructions
create_vpc()
create_subnet()
create_instance()

Declarative (What): Tell the computer WHAT you want

# Describe desired state
resource "aws_vpc" "main" { ... }
resource "aws_subnet" "private" { ... }
resource "aws_instance" "app" { ... }

Terraform figures out HOW to create it.


Introduction to Terraform

What is Terraform?

Terraform is an open-source tool for building, changing, and versioning infrastructure safely and efficiently.

Created by HashiCorp, it works with:

  • AWS, Azure, Google Cloud
  • Kubernetes, Docker
  • GitHub, Datadog
  • 3,000+ providers

Core Terraform Concepts

1. Providers

Providers are plugins that interact with cloud platforms:

provider "aws" {
  region = "us-east-1"
}

provider "kubernetes" {
  host = "https://k8s-cluster.example.com"
}

2. Resources

Resources are infrastructure components:

# A resource has:
# - Type: aws_instance
# - Local name: web_server
# - Configuration: everything inside { }

resource "aws_instance" "web_server" {
  ami           = "ami-12345"
  instance_type = "t3.small"

  tags = {
    Name = "My Web Server"
  }
}

3. Variables

Variables make code reusable:

# variables.tf
variable "environment" {
  description = "Environment name"
  type        = string
  default     = "dev"
}

# main.tf
resource "aws_instance" "server" {
  instance_type = var.environment == "prod" ? "t3.large" : "t3.small"

  tags = {
    Environment = var.environment
  }
}

4. Outputs

Outputs extract information:

output "server_ip" {
  value = aws_instance.web_server.public_ip
}

# After terraform apply:
# server_ip = "54.123.45.67"

5. State

Terraform stores the current state of infrastructure:

# backend.tf
terraform {
  backend "s3" {
    bucket = "my-terraform-state"
    key    = "infrastructure/terraform.tfstate"
    region = "us-east-1"
  }
}

Why state matters:

  • Knows what infrastructure exists
  • Maps resources to real-world objects
  • Tracks metadata and dependencies
  • Enables collaboration (shared state)

Terraform Workflow

# 1. Write configuration
vim main.tf

# 2. Initialize (download providers)
terraform init

# 3. Preview changes
terraform plan

# 4. Apply changes
terraform apply

# 5. View state
terraform show

# 6. Destroy (cleanup)
terraform destroy

Example: Creating an AWS Server

# main.tf
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

# Create VPC
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"

  tags = {
    Name = "main-vpc"
  }
}

# Create subnet
resource "aws_subnet" "public" {
  vpc_id     = aws_vpc.main.id  # Reference to VPC above
  cidr_block = "10.0.1.0/24"

  tags = {
    Name = "public-subnet"
  }
}

# Create server
resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.micro"
  subnet_id     = aws_subnet.public.id  # Reference to subnet

  tags = {
    Name = "web-server"
  }
}

# Output the server's IP
output "server_ip" {
  value = aws_instance.web.public_ip
}

Run it:

terraform init
terraform plan
terraform apply
# server_ip = "54.123.45.67"

Terraform Modules

Modules are reusable Terraform code:

modules/
  └── vpc/
      ├── main.tf      # VPC resources
      ├── variables.tf # Inputs
      └── outputs.tf   # Outputs

environments/
  ├── dev/
  │   └── main.tf     # Uses vpc module
  ├── staging/
  │   └── main.tf     # Uses vpc module
  └── prod/
      └── main.tf     # Uses vpc module

Using a module:

module "vpc" {
  source = "../../modules/vpc"

  cidr_block  = "10.0.0.0/16"
  environment = "dev"
}

# Access module outputs
output "vpc_id" {
  value = module.vpc.vpc_id
}

Benefits:

  • DRY (Don't Repeat Yourself)
  • Tested, reusable components
  • Easier to maintain
  • Consistent across environments

Understanding Kubernetes

What is Kubernetes?

Kubernetes (K8s) is a container orchestration platform that automates deployment, scaling, and management of containerized applications.

Think of it as an operating system for your cluster:

  • Without K8s: You manually start containers on servers
  • With K8s: You declare what you want, K8s makes it happen

Why Kubernetes?

Problem without K8s:

Server 1: Running app-v1, app-v1, app-v2 (manual deployment)
Server 2: Running app-v1, crashed (needs manual restart)
Server 3: Idle (wasting money)

Solution with K8s:

3 Servers (cluster)
K8s automatically:
- Distributes containers evenly
- Restarts crashed containers
- Scales up/down based on load
- Updates apps with zero downtime

Kubernetes Architecture

┌─────────────────────────────────────────────────────┐
│              Kubernetes Cluster                     │
│                                                     │
│  ┌───────────────────────────────────────────────┐  │
│  │         Control Plane (Master)                │  │
│  │  ┌──────────┐  ┌───────┐  ┌──────────────┐    │  │
│  │  │ API      │  │ etcd  │  │ Scheduler    │    │  │
│  │  │ Server   │  │       │  │ Controller   │    │  │
│  │  └──────────┘  └───────┘  └──────────────┘    │  │
│  └───────────────────────────────────────────────┘  │
│                                                     │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐           │
│  │ Worker 1 │  │ Worker 2 │  │ Worker 3 │           │
│  │          │  │          │  │          │           │
│  │  Pods    │  │  Pods    │  │  Pods    │           │
│  │  [App]   │  │  [App]   │  │  [App]   │           │
│  └──────────┘  └──────────┘  └──────────┘           │
└─────────────────────────────────────────────────────┘

Control Plane Components:

  • API Server: Front-end for K8s (you talk to this)
  • etcd: Database storing cluster state
  • Scheduler: Decides which worker runs which pod
  • Controller Manager: Maintains desired state

Worker Node Components:

  • kubelet: Agent running on each node
  • Container Runtime: Runs containers (Docker, containerd)
  • kube-proxy: Network proxy for services

Key Kubernetes Concepts

1. Pods

Pod = Smallest deployable unit (1+ containers)

apiVersion: v1
kind: Pod
metadata:
  name: nginx-pod
spec:
  containers:
  - name: nginx
    image: nginx:1.25
    ports:
    - containerPort: 80

2. Deployments

Deployment = Manages multiple pods (replicas)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 3  # Run 3 copies
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.25
        ports:
        - containerPort: 80

What it does:

  • Maintains 3 nginx pods
  • If one crashes, creates a new one
  • Rolling updates (update without downtime)
  • Rollback if update fails

3. Services

Service = Stable network endpoint for pods

apiVersion: v1
kind: Service
metadata:
  name: nginx-service
spec:
  selector:
    app: nginx
  ports:
  - port: 80
    targetPort: 80
  type: LoadBalancer  # Creates AWS load balancer

Service Types:

  • ClusterIP: Internal only (default)
  • NodePort: Accessible on node IP:port
  • LoadBalancer: Creates cloud load balancer
  • ExternalName: DNS alias

4. Namespaces

Namespace = Virtual cluster (logical separation)

# System namespaces
kube-system     # K8s components
kube-public     # Public resources
default         # Default namespace

# Custom namespaces
dev             # Development apps
staging         # Staging apps
prod            # Production apps
apiVersion: v1
kind: Namespace
metadata:
  name: production
# Deploy to namespace
kubectl apply -f app.yaml -n production

5. ConfigMaps and Secrets

ConfigMap = Configuration data

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  database_url: "postgres://db.example.com:5432"
  log_level: "info"

Secret = Sensitive data (base64 encoded)

apiVersion: v1
kind: Secret
metadata:
  name: db-secret
type: Opaque
data:
  password: cGFzc3dvcmQxMjM=  # base64 encoded

Using in pod:

spec:
  containers:
  - name: app
    image: myapp:1.0
    env:
    - name: DATABASE_URL
      valueFrom:
        configMapKeyRef:
          name: app-config
          key: database_url
    - name: DB_PASSWORD
      valueFrom:
        secretKeyRef:
          name: db-secret
          key: password

Self-Managed vs Managed Kubernetes

AspectSelf-Managed (kubeadm)Managed (EKS/GKE/AKS)
SetupManual (complex)Automated (easy)
Control PlaneYou manageCloud manages
UpgradesManualAutomated options
CostEC2 costs onlyEC2 + control plane fee
CustomizationFull controlSome limitations
ResponsibilityEverythingJust worker nodes
Best ForLearning, custom needsProduction, simplicity

Our Infrastructure uses self-managed (kubeadm) for:

  • Full control and learning
  • No control plane costs
  • Understanding how K8s works internally

GitOps: The Modern Way

What is GitOps?

GitOps = Using Git as the single source of truth for infrastructure and applications.

Traditional Deployment:
Developer → kubectl apply → Kubernetes

GitOps Deployment:
Developer → Git commit → ArgoCD → Kubernetes

GitOps Principles

  1. Declarative: Desired state declared in Git
  2. Versioned: All changes tracked in Git
  3. Pulled: Agents pull changes (don't push)
  4. Continuously Reconciled: Auto-sync to match Git

GitOps with ArgoCD

ArgoCD is a GitOps tool for Kubernetes:

┌──────────┐      ┌──────────┐      ┌─────────────┐
│   Git    │◄─────│  ArgoCD  │─────▶│ Kubernetes  │
│  Repo    │ Pull │  Syncs   │Apply │   Cluster   │
└──────────┘      └──────────┘      └─────────────┘
     │                                      │
     │  Developer commits                   │
     │  deployment.yaml                     │
     │                                      │
     └──────────────────────────────────────┘
           ArgoCD detects change
           and applies to cluster

Workflow:

Self-healing:

If someone manually changes K8s
ArgoCD detects drift
Reverts to Git state

ArgoCD syncs:

Applies changes to Kubernetes
Monitors rollout
Reports status back

ArgoCD detects change:

ArgoCD polls Git every 3 minutes
Sees new commit
Compares Git state vs Cluster state

Developer makes change:

# Update image version
vim kubernetes/app-deployment.yaml
git commit -m "Update app to v1.2.0"
git push

ArgoCD Application

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
  namespace: argocd
spec:
  # Where is the source?
  source:
    repoURL: https://github.com/myorg/my-repo
    path: kubernetes/overlays/production
    targetRevision: main

  # Where to deploy?
  destination:
    server: https://kubernetes.default.svc
    namespace: production

  # How to sync?
  syncPolicy:
    automated:
      prune: true      # Delete resources not in Git
      selfHeal: true   # Revert manual changes
    syncOptions:
    - CreateNamespace=true

Benefits of GitOps

Security: No kubectl access needed

Developers → Git (with PR review)
ArgoCD → Kubernetes (with K8s credentials)

Developers never need cluster access

Developer Friendly: Use familiar Git workflow

# Same as code development
git checkout -b feature
# Make changes
git commit && git push
# Create pull request
# After approval, auto-deploys

Disaster Recovery: Recreate everything from Git

# Lost cluster? Just point ArgoCD to Git
# Everything recreates automatically

Easy Rollback: Revert Git commit

git revert HEAD
git push
# ArgoCD automatically rolls back

Audit Trail: Every change tracked in Git

git log kubernetes/
# See who deployed what and when

Real-World Lessons Learned

Lesson 1: The Missing IAM Permissions

What Happened

Worker nodes were stuck in "Pending" state for 4+ hours. All ArgoCD pods couldn't schedule.

kubectl get pods -n argocd
# All pods: STATUS=Pending for 4+ hours

Root Cause

The infrastructure uses a clever design:

  1. Control plane node creates a kubeadm join command
  2. Stores it in AWS Systems Manager (SSM) Parameter Store
  3. Worker nodes retrieve the command from SSM
  4. Use it to join the cluster

The problem: IAM role had AmazonSSMManagedInstanceCore policy, which only allows SSM Session Manager (SSH access), not SSM Parameter Store (storing/retrieving parameters).

Control Plane tried to:
aws ssm put-parameter --name "/cluster/join-command" ...
❌ AccessDenied: Not authorized to perform ssm:PutParameter

Worker Nodes waiting for parameter that was never created...
⏰ Waiting... 4 hours... still waiting...

The Fix

Added SSM Parameter Store permissions to IAM role:

statement {
  sid = "SSMParameterStore"
  actions = [
    "ssm:GetParameter",
    "ssm:PutParameter",
    "ssm:DeleteParameter"
  ]
  resources = [
    "arn:aws:ssm:*:*:parameter/gitops-infra-*"
  ]
}

Lessons Learned

  1. IAM Policies are SpecificAmazonSSMManagedInstanceCore ≠ Parameter Store access
    • Read policy documentation carefully
    • Test what the policy actually allows
  2. Understand Dependencies: Worker nodes depend on control plane's SSM parameter
    • Document dependencies in code comments
    • Test the full workflow

Validate IAM Before Launch:

# Test IAM permissions before deploying
aws ssm put-parameter --name "/test" --value "test" --type "String"

Check Logs Early:

# Don't wait 4 hours!
# Check logs immediately when pods are pending
kubectl describe pod <pod-name>
ssh to-node
sudo journalctl -u cloud-final | grep error

Lesson 2: Control Plane Taints

What Happened

Only one node in cluster (control-plane), but pods can't schedule on it.

kubectl get nodes
# NAME                  STATUS   ROLES           AGE
# control-plane-node    Ready    control-plane   5h

kubectl get pods -n argocd
# All pods: Pending

kubectl describe pod argocd-server-xxx
# Events: 0/1 nodes are available: 1 node(s) had untolerated taint

Root Cause

Kubernetes taints the control-plane node by default:

node-role.kubernetes.io/control-plane: NoSchedule

Why? Security and stability:

  • Control plane runs critical K8s components
  • User workloads might consume resources
  • Separate control plane from workloads

The problem: No worker nodes exist, only control plane.

The Fix

Option 1: Remove taint (dev/testing only)

kubectl taint nodes --all node-role.kubernetes.io/control-plane-

Option 2: Add worker nodes (proper solution)

# Fix IAM + reboot workers
aws ec2 reboot-instances --instance-ids i-xxx i-yyy

Lessons Learned

  1. Cluster Size Matters:
    • 1 node: Remove taint (development only)
    • 3+ nodes: Keep taint (production practice)

Check Node Availability First:

kubectl get nodes
kubectl describe nodes
# Look for taints and allocatable resources

Understand Taints and Tolerations:

# Taint = "I don't want pods"
node-role.kubernetes.io/control-plane: NoSchedule

# Toleration = "I can handle that taint"
tolerations:
- key: node-role.kubernetes.io/control-plane
  operator: Exists
  effect: NoSchedule

Lesson 3: Server-Side Apply for Large CRDs

What Happened

Installing ArgoCD failed:

kubectl apply -f install.yaml
# Error: The CustomResourceDefinition "applicationsets.argoproj.io" is invalid:
# metadata.annotations: Too long: must have at most 262144 bytes

Root Cause

Client-side apply (kubectl apply):

  • Stores entire manifest in annotation kubectl.kubernetes.io/last-applied-configuration
  • Used for three-way merge on updates
  • ArgoCD CRDs are huge (>262KB)
  • Exceeds Kubernetes annotation limit

Server-side apply (kubectl apply --server-side):

  • Server tracks changes (no annotation needed)
  • No size limit
  • Modern approach (introduced K8s 1.16)

The Fix

# Old way (fails)
kubectl apply -f argocd-install.yaml

# New way (works)
kubectl apply --server-side=true --force-conflicts -f argocd-install.yaml

Lessons Learned

  1. Use Server-Side Apply for Large Resources:
    • CRDs, especially from complex tools (ArgoCD, Istio, etc.)
    • Avoids annotation size limits
    • Better conflict detection

Handle Conflicts Gracefully:

# If switching from client-side to server-side
--force-conflicts  # Take ownership of fields

Understand Apply Mechanisms:

Client-side apply:
- Three-way merge (current, desired, last-applied)
- Last-applied stored in annotation
- Legacy approach

Server-side apply:
- Server tracks field ownership
- No annotation needed
- Modern, recommended

Lesson 4: EC2 Instances Running ≠ Nodes Ready

What Happened

aws ec2 describe-instances
# i-xxx: running ✓
# i-yyy: running ✓

kubectl get nodes
# Only control-plane node (no workers)

EC2 shows instances running, but Kubernetes doesn't see them.

Root Cause

EC2 instance "running" just means:

  • VM is powered on
  • Operating system booted

It does NOT mean:

  • Software installed correctly
  • Node joined Kubernetes cluster
  • Ready to run pods

The problem: kubeadm join command never ran successfully (due to missing SSM parameter).

Lessons Learned

Validate from Multiple Perspectives:

# AWS perspective
aws ec2 describe-instances

# Kubernetes perspective
kubectl get nodes

# Application perspective
kubectl get pods

Check User-Data Logs:

ssh to-instance
sudo cat /var/log/cloud-init-output.log
sudo journalctl -u cloud-final

EC2 Status vs Application Status:

EC2 Instance: running ✓
├─ OS Booted: yes ✓
├─ User-data script: failed ✗
└─ Application: not working ✗

Lesson 5: Bootstrap Order Matters

What Happened

Worker nodes tried to join before control plane was ready, or control plane couldn't store join command due to missing permissions.

Root Cause

Dependency chain:

1. IAM role created
   ↓
2. EC2 instances launch
   ↓
3. Control plane initializes
   ↓
4. Control plane stores join command (needs IAM)
   ↓
5. Worker nodes retrieve join command (needs IAM)
   ↓
6. Worker nodes join cluster

If step 4 or 5 fails, everything breaks.

Lessons Learned

Dependencies in Code:

resource "aws_instance" "worker" {
  # ...
  depends_on = [
    aws_ssm_parameter.join_command  # Explicit dependency
  ]
}

Add Retry Logic:

# In worker.sh
for i in {1..30}; do
  JOIN_COMMAND=$(aws ssm get-parameter ...) && break
  echo "Waiting for join command... ($i/30)"
  sleep 10
done

IAM Propagation Delay:

IAM changes aren't instant
Wait 10-30 seconds after creating role
Before launching instances that use it

Bootstrap Order Matters:

# Correct order:
1. terraform apply (bootstrap/iam)      # IAM roles
2. Wait for IAM propagation             # 10-30 seconds
3. terraform apply (environments/dev)    # Infrastructure
4. Verify control plane ready           # Check logs
5. Verify workers joined                # kubectl get nodes

How Everything Works Together

The Complete Infrastructure Flow

Let's trace how everything connects in our infrastructure:

┌─────────────────────────────────────────────────────────┐
│                    1. Developer                          │
│                                                          │
│  git commit -m "Add new feature"                         │
│  git push origin main                                    │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│                 2. GitHub Actions                        │
│                                                          │
│  - Runs on: push to main                                 │
│  - Authenticates: OIDC (no static keys!)                 │
│  - Validates: terraform plan                             │
│  - Applies: terraform apply (if approved)                │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│                  3. Terraform                            │
│                                                          │
│  - Reads: .tf files from Git                             │
│  - Plans: Calculate changes needed                       │
│  - Applies: Create/update AWS resources                  │
│  - State: Stored in S3 (gitops-infra-terraform-state)    │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│                    4. AWS                                │
│                                                          │
│  ┌─────────────┐  ┌──────────────┐  ┌────────────────┐  │
│  │    VPC      │  │  EC2 Control │  │  EC2 Workers   │  │
│  │  Subnets    │  │    Plane     │  │   (2 nodes)    │  │
│  │  NAT/IGW    │  │  (1 node)    │  │                │  │
│  └─────────────┘  └──────┬───────┘  └───────┬────────┘  │
│                          │                   │           │
│                          │ Runs kubeadm init │           │
│                          │                   │           │
│                          ▼                   │           │
│                  ┌────────────────┐          │           │
│                  │  SSM Parameter │◄─────────┘           │
│                  │  /cluster/join │  Workers fetch       │
│                  └────────────────┘  join command        │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│              5. Kubernetes Cluster                       │
│                                                          │
│  ┌──────────────────────────────────────────────────┐   │
│  │  Control Plane (Master)                          │   │
│  │  - API Server (port 6443)                        │   │
│  │  - etcd (cluster state)                          │   │
│  │  - Scheduler (assigns pods to nodes)             │   │
│  │  - Controller Manager (maintains state)          │   │
│  └──────────────────────────────────────────────────┘   │
│                                                          │
│  ┌────────────┐              ┌────────────┐             │
│  │  Worker 1  │              │  Worker 2  │             │
│  │            │              │            │             │
│  │  kubelet   │              │  kubelet   │             │
│  │  Pods:     │              │  Pods:     │             │
│  │  - ArgoCD  │              │  - Apps    │             │
│  └────────────┘              └────────────┘             │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│                  6. ArgoCD                               │
│                                                          │
│  - Installed in: argocd namespace                        │
│  - Watches: GitHub repo (infrastructure)                 │
│  - Syncs: Every 3 minutes                                │
│  - Applies: Kubernetes manifests from Git                │
│                                                          │
│  Applications managed:                                   │
│  ├── ingress-nginx (core)                                │
│  ├── cert-manager (core)                                 │
│  ├── external-secrets (core)                             │
│  └── sample-app (application)                            │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│            7. Application Deployment                     │
│                                                          │
│  Developer commits: kubernetes/app-deployment.yaml       │
│           ↓                                              │
│  ArgoCD detects change in Git                            │
│           ↓                                              │
│  ArgoCD applies to Kubernetes                            │
│           ↓                                              │
│  Deployment creates Pods                                 │
│           ↓                                              │
│  Service exposes Pods                                    │
│           ↓                                              │
│  Ingress routes traffic (HTTPS with cert-manager)        │
│           ↓                                              │
│  Application running ✓                                   │
└─────────────────────────────────────────────────────────┘

Component Interaction Example

Let's trace a real deployment:

Scenario: Deploy new version of application

Day 1, 9:00 AM - Developer makes change:
─────────────────────────────────────────
vim helm/values/dev/values.yaml
# Change: image.tag: "1.1.0" → "1.2.0"

git commit -m "Update app to v1.2.0"
git push origin main

├─→ GitHub receives push
    ├─→ Triggers GitHub Actions workflow
        ├─→ Runs terraform plan (if Terraform changed)
        └─→ No Terraform changes, skips apply

Day 1, 9:03 AM - ArgoCD detects change:
────────────────────────────────────────
ArgoCD polls GitHub every 3 minutes
├─→ Fetches latest commit
├─→ Compares Git state vs Cluster state
└─→ Detects difference: image tag changed

├─→ ArgoCD syncs application
    ├─→ kubectl apply -f deployment.yaml
    ├─→ Kubernetes creates new ReplicaSet
    │   └─→ New ReplicaSet: image:1.2.0
    │       Old ReplicaSet: image:1.1.0
    │
    ├─→ Rolling update starts:
    │   ├─→ Create 1 pod with new version
    │   ├─→ Wait for pod ready
    │   ├─→ Terminate 1 pod with old version
    │   └─→ Repeat until all pods updated
    │
    └─→ ArgoCD monitors health
        ├─→ Checks pod status
        ├─→ Checks readiness probes
        └─→ Reports: Healthy ✓

Day 1, 9:05 AM - Deployment complete:
──────────────────────────────────────
kubectl get pods
# NAME                    READY   STATUS    RESTARTS   AGE
# app-55c8d9f89d-abc12    1/1     Running   0          2m
# app-55c8d9f89d-def34    1/1     Running   0          2m
# app-55c8d9f89d-ghi56    1/1     Running   0          2m

kubectl get replicasets
# NAME              DESIRED   CURRENT   READY   AGE
# app-55c8d9f89d    3         3         3       2m    (new)
# app-64b7c8d78e    0         0         0       5d    (old)

ArgoCD UI shows:
├─→ Status: Synced + Healthy
├─→ Last Sync: 2 minutes ago
└─→ Git Commit: abc123 "Update app to v1.2.0"

The Security Flow

How authentication and authorization work:

┌─────────────────────────────────────────────────────────┐
│                  Security Layers                         │
└─────────────────────────────────────────────────────────┘

1. GitHub Actions → AWS
─────────────────────────
GitHub Actions workflow runs
├─→ Uses: aws-actions/configure-aws-credentials
├─→ Method: OIDC (OpenID Connect)
│   ├─→ No static AWS keys in GitHub!
│   ├─→ GitHub provides JWT token
│   └─→ AWS validates token
│
└─→ Assumes IAM role: gitops-infra-github-actions-role
    └─→ Role has: AdministratorAccess (Terraform needs it)

2. EC2 Instances → AWS Services
────────────────────────────────
EC2 instance launches
├─→ Attached: Instance Profile (gitops-infra-k8s-node-profile)
├─→ Profile contains: IAM Role (gitops-infra-k8s-node-role)
│
└─→ Role has permissions:
    ├─→ EC2: Describe instances, attach volumes
    ├─→ ECR: Pull container images
    ├─→ SSM: Session Manager + Parameter Store
    └─→ ELB: Describe load balancers

3. kubectl → Kubernetes API
────────────────────────────
User runs kubectl command
├─→ Reads: ~/.kube/config
│   ├─→ Contains: API server URL
│   └─→ Contains: Client certificate
│
└─→ API Server validates:
    ├─→ Authentication: Is this a valid user?
    │   └─→ Checks client certificate
    │
    └─→ Authorization: What can they do?
        └─→ RBAC (Role-Based Access Control)
            ├─→ Roles: Define permissions
            └─→ RoleBindings: Assign roles to users

4. Pods → AWS Services
───────────────────────
Pod needs to access AWS (e.g., S3, Secrets Manager)
├─→ Option 1: IRSA (IAM Roles for Service Accounts)
│   ├─→ Service Account annotated with IAM role
│   ├─→ Pod uses this Service Account
│   └─→ Automatically gets temporary AWS credentials
│
└─→ Option 2: Node IAM role (less secure)
    └─→ Pod inherits node's IAM role permissions

The State Management Flow

How Terraform tracks infrastructure:

┌─────────────────────────────────────────────────────────┐
│              Terraform State Flow                        │
└─────────────────────────────────────────────────────────┘

1. Initial Setup
────────────────
terraform {
  backend "s3" {
    bucket = "gitops-infra-terraform-state-ACCOUNT_ID"
    key    = "environments/dev/terraform.tfstate"
    region = "eu-west-1"
  }
}

State file structure:
{
  "version": 4,
  "terraform_version": "1.5.0",
  "resources": [
    {
      "type": "aws_instance",
      "name": "control_plane",
      "instances": [{
        "attributes": {
          "id": "i-0abc123",
          "instance_type": "t3.small",
          "private_ip": "10.10.10.5"
        }
      }]
    }
  ]
}

2. Making Changes
─────────────────
Developer: vim main.tf
# Change: instance_type = "t3.medium"

terraform plan:
├─→ Downloads state from S3
├─→ Reads current .tf files
├─→ Queries AWS API for actual state
│
└─→ Compares three states:
    ├─→ State file: t3.small (last known)
    ├─→ Config file: t3.medium (desired)
    └─→ AWS actual: t3.small (current)

    Plan:
    ~ aws_instance.control_plane
      ~ instance_type: "t3.small" → "t3.medium"

terraform apply:
├─→ Acquires lock (DynamoDB)
│   └─→ Prevents concurrent applies
│
├─→ Makes changes in AWS
│   └─→ Modifies instance to t3.medium
│
├─→ Updates state file
│   └─→ Writes new state to S3
│
└─→ Releases lock

3. Team Collaboration
──────────────────────
Developer A (local):
├─→ terraform plan
├─→ Downloads latest state from S3
└─→ Sees: Instance is t3.medium

Developer B (GitHub Actions):
├─→ terraform apply
├─→ Acquires DynamoDB lock
├─→ Developer A's apply would wait or fail
└─→ Releases lock after complete

State locking prevents:
❌ Two people applying simultaneously
❌ Corrupted state files
❌ Lost changes

Best Practices

1. Infrastructure as Code

DO:

# ✓ Use modules for reusability
module "vpc" {
  source = "../../modules/vpc"
  cidr_block = var.cidr_block
}

# ✓ Use variables for different environments
variable "instance_type" {
  default = "t3.small"
}

# ✓ Tag everything
tags = {
  Environment = var.environment
  ManagedBy   = "terraform"
  Project     = var.project_name
}

# ✓ Use remote state
backend "s3" {
  bucket = "my-terraform-state"
  key    = "path/to/state"
}

DON'T:

# ✗ Hardcode values
resource "aws_instance" "server" {
  instance_type = "t3.small"  # What about staging/prod?
}

# ✗ Store state locally
# Missing: backend configuration
# State only on your laptop = bad

# ✗ Use default VPC
resource "aws_instance" "server" {
  # No vpc_id specified = uses default
  # Default VPC is not production-ready
}

2. Security

DO:

# ✓ Least privilege IAM
data "aws_iam_policy_document" "app" {
  statement {
    actions = [
      "s3:GetObject",
      "s3:PutObject"
    ]
    resources = [
      "arn:aws:s3:::my-bucket/app-data/*"
    ]
  }
}

# ✓ Encrypt everything
resource "aws_ebs_volume" "data" {
  encrypted = true
}

# ✓ Private subnets for workloads
resource "aws_subnet" "private" {
  map_public_ip_on_launch = false
}

# ✓ Use OIDC for GitHub Actions
# No static AWS keys in GitHub!

DON'T:

# ✗ Overly permissive IAM
policy_arn = "arn:aws:iam::aws:policy/AdministratorAccess"

# ✗ Unencrypted volumes
resource "aws_ebs_volume" "data" {
  encrypted = false  # Bad!
}

# ✗ Everything in public subnets
resource "aws_subnet" "public" {
  map_public_ip_on_launch = true  # For NAT/ALB only!
}

# ✗ Static credentials in GitHub
AWS_ACCESS_KEY_ID: "AKIAIOSFODNN7EXAMPLE"  # Never!

3. Kubernetes

DO:

# ✓ Set resource limits
resources:
  limits:
    cpu: 500m
    memory: 512Mi
  requests:
    cpu: 250m
    memory: 256Mi

# ✓ Use health checks
livenessProbe:
  httpGet:
    path: /health
    port: 8080
readinessProbe:
  httpGet:
    path: /ready
    port: 8080

# ✓ Don't run as root
securityContext:
  runAsNonRoot: true
  runAsUser: 1000

# ✓ Use namespaces
metadata:
  namespace: production

DON'T:

# ✗ No resource limits
# Missing resources: {}
# Pod can consume all node resources

# ✗ No health checks
# Missing probes
# K8s doesn't know if app is healthy

# ✗ Run as root
securityContext:
  runAsUser: 0  # root = dangerous

# ✗ Everything in default namespace
# Missing namespace
# No isolation between apps

4. GitOps

DO:

# ✓ Environment-specific configurations
overlays/
  dev/
    kustomization.yaml
  staging/
    kustomization.yaml
  prod/
    kustomization.yaml

# ✓ Automated sync for dev/staging
syncPolicy:
  automated:
    prune: true
    selfHeal: true

# ✓ Manual sync for production
syncPolicy:
  # No automated sync
  # Requires manual approval

# ✓ Use health checks
health:
  status: Healthy
  message: "Deployment has minimum availability"

DON'T:

# ✗ Same config for all environments
# One size doesn't fit all

# ✗ Auto-sync production without review
# Scary! Auto-deploy to prod without approval

# ✗ Ignore sync failures
# ArgoCD reports failure, but you don't notice

# ✗ Manual kubectl in production
# Bypasses GitOps, creates drift

5. Monitoring and Observability

DO:

# ✓ Structured logging
logger.info("User logged in", {
  "user_id": user.id,
  "ip": request.ip,
  "timestamp": datetime.now()
})

# ✓ Metrics
prometheus.histogram("request_duration_seconds")
prometheus.counter("requests_total")

# ✓ Distributed tracing
@trace(service="api", operation="get_user")
def get_user(user_id):
  ...

# ✓ Alerts
- alert: HighErrorRate
  expr: rate(errors_total[5m]) > 0.05
  for: 10m

DON'T:

# ✗ Print statements
print("User logged in")  # Lost in logs

# ✗ No metrics
# Can't see trends or anomalies

# ✗ No tracing
# Can't debug distributed systems

# ✗ No alerts
# Don't know when things break

6. Documentation

DO:

# ✓ Document decisions
## Architecture Decision Record (ADR)
- Date: 2024-01-15
- Decision: Use kubeadm instead of EKS
- Rationale: Cost savings, learning opportunity
- Consequences: More operational overhead

# ✓ Runbooks
## Deploying to Production
1. Create PR with changes
2. Wait for CI/CD to pass
3. Request review from team
4. Merge after approval
5. Monitor ArgoCD sync
6. Verify health checks

# ✓ Diagrams

[Architectural diagrams in docs/]


# ✓ Code comments for "why"
# We use server-side apply because CRDs exceed annotation limit
kubectl apply --server-side=true

DON'T:

# ✗ No documentation
# Just read the code!

# ✗ Outdated documentation
# Last updated: 2 years ago
# Doesn't match current setup

# ✗ Only "what" without "why"
# This creates a server
# (But why? What's it for?)

Next Steps

1. Complete the Current Setup

Fix the worker node issue:

# 1. Apply IAM changes
cd terraform/bootstrap/iam
terraform apply

# 2. Create SSM parameter
ssh to-control-plane
sudo kubeadm token create --print-join-command | \
  aws ssm put-parameter --name "/gitops-infra-dev/kubeadm-join-command" ...

# 3. Reboot workers
aws ec2 reboot-instances --instance-ids i-xxx i-yyy

# 4. Verify
kubectl get nodes
# Should see 3 nodes: 1 control-plane + 2 workers

2. Learn by Doing

Exercise 1: Deploy Your First App

# Create a simple nginx deployment
kubectl create deployment nginx --image=nginx:1.25
kubectl expose deployment nginx --port=80 --type=NodePort
kubectl get svc nginx
# Access at: http://<node-ip>:<node-port>

Exercise 2: Use ArgoCD

# Create an ArgoCD application
cat <<EOF | kubectl apply -f -
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
  namespace: argocd
spec:
  source:
    repoURL: https://github.com/yourusername/your-repo
    path: kubernetes/my-app
    targetRevision: main
  destination:
    server: https://kubernetes.default.svc
    namespace: default
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
EOF

Exercise 3: Make Infrastructure Changes

# 1. Create a branch
git checkout -b add-monitoring-namespace

# 2. Add a namespace
cat <<EOF >> kubernetes/base/namespaces/namespaces.yaml
---
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
  labels:
    name: monitoring
EOF

# 3. Commit and push
git add .
git commit -m "Add monitoring namespace"
git push origin add-monitoring-namespace

# 4. Create PR
# Watch GitHub Actions run
# Merge after approval
# Watch ArgoCD sync

3. Expand Your Knowledge

Topics to Study:

  1. Kubernetes Deep Dive
  2. Terraform Mastery
  3. GitOps
  4. Observability
    • Prometheus + Grafana for metrics
    • ELK/EFK stack for logging
    • Jaeger for distributed tracing
    • Practice: Add monitoring to your cluster
  5. Security

4. Build Real Projects

Project Ideas:

  1. Personal Blog Platform
    • Deploy WordPress on Kubernetes
    • Use MySQL database
    • SSL with cert-manager
    • GitOps with ArgoCD
  2. Microservices Application
    • Multiple services (frontend, backend, database)
    • Service mesh (Istio/Linkerd)
    • Distributed tracing
    • CI/CD pipeline
  3. Monitoring Stack
    • Prometheus for metrics
    • Grafana for dashboards
    • AlertManager for alerts
    • Loki for logs
  4. Multi-Region Deployment
    • Deploy to multiple AWS regions
    • Use Route53 for DNS failover
    • Database replication
    • Disaster recovery plan

5. Get Certified

Recommended Certifications:

  1. AWS Certified Solutions Architect
    • Validates AWS knowledge
    • Industry recognized
    • Good for career growth
  2. Certified Kubernetes Administrator (CKA)
    • Hands-on exam
    • Proves K8s operational skills
    • Highly valued
  3. Terraform Associate
    • Official HashiCorp certification
    • Validates IaC skills
    • Growing demand

6. Join the Community

Resources:

  • Reddit: r/kubernetes, r/devops, r/terraform
  • Discord: Kubernetes, CNCF, DevOps
  • Twitter: Follow #kubernetes, #devops, #terraform
  • Meetups: Local DevOps/Kubernetes meetups
  • Conferences: KubeCon, HashiConf, DevOpsDays

7. Keep a Learning Log

Document your journey:

# DevOps Learning Log

## 2024-01-15: Fixed Worker Node Issue
- Problem: Nodes stuck in pending
- Root cause: Missing IAM permissions
- Lesson: Always verify IAM before deploying
- Next: Study IAM policies in depth

## 2024-01-16: Deployed First App with ArgoCD
- Created simple nginx deployment
- Set up ArgoCD sync
- Watched automatic deployment
- Next: Deploy multi-tier app

## 2024-01-17: Implemented Monitoring
- Installed Prometheus
- Created Grafana dashboards
- Set up alerts
- Next: Add distributed tracing

Conclusion

What You've Learned

In this journey, you've learned:

  1. DevOps Fundamentals
    • Automation, CI/CD, IaC
    • Collaboration between dev and ops
    • Version control for everything
  2. Infrastructure as Code
    • Terraform basics and advanced concepts
    • Modules, state, and providers
    • Multi-environment management
  3. Kubernetes
    • Architecture and components
    • Pods, Deployments, Services
    • Self-managed vs managed clusters
  4. GitOps
    • ArgoCD and continuous deployment
    • Declarative configuration
    • Self-healing and automated sync
  5. Real-World Troubleshooting
    • IAM permissions debugging
    • Node scheduling issues
    • CRD size limitations
    • Bootstrap order dependencies

The DevOps Mindset

Remember these principles:

  1. Automate Everything
    • If you do it twice, automate it
    • Manual processes are error-prone
    • Automation enables scale
  2. Version Control Everything
    • Code, infrastructure, configs
    • Git is the source of truth
    • Audit trail for all changes
  3. Fail Fast, Learn Fast
    • Test in dev/staging first
    • Failures are learning opportunities
    • Iterate quickly
  4. Measure Everything
    • Metrics, logs, traces
    • You can't improve what you don't measure
    • Data-driven decisions
  5. Never Stop Learning
    • Technology evolves rapidly
    • Always be curious
    • Share knowledge with others

Your Infrastructure

Your current setup is production-ready with:

  • ✓ Multi-environment Terraform (dev/staging/prod)
  • ✓ Self-managed Kubernetes cluster
  • ✓ GitOps with ArgoCD
  • ✓ CI/CD with GitHub Actions
  • ✓ Security best practices (IAM, OIDC, encryption)
  • ✓ Comprehensive documentation

Next steps:

  1. Fix the worker node issue (follow the guide above)
  2. Deploy your first application
  3. Expand to staging and production
  4. Add monitoring and logging
  5. Keep learning and improving

Final Thoughts

DevOps is a journey, not a destination. You'll encounter problems, make mistakes, and learn continuously. That's normal and healthy.

What matters:

  • Keep building
  • Keep learning
  • Keep improving
  • Keep sharing

Welcome to the world of DevOps! 🚀


Additional Resources

Books

  • "The Phoenix Project" by Gene Kim - DevOps novel
  • "Kubernetes Up & Running" by Kelsey Hightower
  • "Terraform: Up & Running" by Yevgeniy Brikman
  • "Site Reliability Engineering" by Google

Online Courses

  • A Cloud Guru - AWS, Kubernetes, Terraform
  • Linux Academy - DevOps learning paths
  • Udemy - Specific technology courses
  • Coursera - University-level courses

Hands-On Labs

  • Katacoda (now O'Reilly) - Interactive scenarios
  • Play with Kubernetes - Free K8s playground
  • AWS Free Tier - Practice on real cloud
  • KillerCoda - K8s and cloud labs

Documentation

Tools to Explore

  • kubectl - K8s CLI
  • helm - K8s package manager
  • k9s - Terminal UI for K8s
  • lens - K8s IDE
  • terraform - IaC tool
  • terragrunt - Terraform wrapper
  • docker - Containerization
  • git - Version control

Remember: Every expert was once a beginner. Keep learning, keep building, and don't be afraid to break things (in dev environment)!

Happy learning! 🎓