DevOps Learning Journey: From Concepts to Production
A beginner's guide to understanding DevOps, Infrastructure as Code, and lessons learned from building a production-ready Kubernetes infrastructure.
What is DevOps?
The Traditional Problem
In traditional software development:
- Developers write code on their laptops
- Operations deploy and maintain servers
- These teams work in silos (separately)
- Deployments are manual, slow, and error-prone
- "It works on my machine" becomes a common problem
The DevOps Solution
DevOps combines Development + Operations into a unified workflow:
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Developer │────▶│ Automation │────▶│ Production │
│ writes │ │ Pipeline │ │ Server │
│ code │ │ │ │ │
└─────────────┘ └──────────────┘ └─────────────┘
│
│ Tests, Builds,
│ Deploys automatically
│
Key Principles:
- Automation: Automate everything (testing, building, deploying)
- Version Control: Store everything in Git (code, infrastructure, configs)
- Continuous Integration/Deployment (CI/CD): Deploy frequently and safely
- Monitoring: Know what's happening in production
- Collaboration: Developers and operations work together
Why DevOps Matters
| Without DevOps | With DevOps |
|---|---|
| Manual deployments taking hours/days | Automated deployments in minutes |
| Configuration drift (servers differ) | Consistent, reproducible infrastructure |
| "Works on my machine" problems | Identical dev/staging/prod environments |
| Fear of deploying (might break things) | Confidence through automation and testing |
| Slow feedback loops | Fast feedback and rapid iteration |
Core DevOps Concepts
1. Infrastructure as Code (IaC)
Traditional Approach:
1. Log into AWS console
2. Click "Create EC2 instance"
3. Choose settings manually
4. Repeat for each server
5. Hope you remember what you did
IaC Approach:
# infrastructure.tf
resource "aws_instance" "web_server" {
ami = "ami-12345"
instance_type = "t3.small"
tags = {
Name = "web-server-1"
}
}
Run terraform apply and it creates the server. Same code = same infrastructure every time.
2. Immutable Infrastructure
Old Way (Mutable):
- Create a server
- SSH in and install software
- Update it over time
- Each server becomes unique (snowflake)
New Way (Immutable):
- Define server configuration in code
- Build server image
- Deploy new servers, delete old ones
- Every server is identical and replaceable
3. Configuration Management
Tools like:
- Terraform: Creates infrastructure (servers, networks, databases)
- Ansible/Chef/Puppet: Configures servers after creation
- Docker: Packages applications with dependencies
- Kubernetes: Orchestrates containers at scale
4. CI/CD Pipeline
Developer commits code to Git
↓
Automated tests run
↓
Build Docker image
↓
Deploy to staging
↓
Integration tests
↓
Deploy to production
Infrastructure as Code (IaC)
What is IaC?
Infrastructure as Code means writing code to manage and provision infrastructure instead of manual processes.
Benefits of IaC
Collaboration: Team can review infrastructure changes
# Pull request workflow
git checkout -b add-monitoring
# Make changes
git push
# Create PR for team review
Testing: Test infrastructure before deploying
terraform plan # Preview changes before applying
Documentation: Code IS the documentation
# This clearly shows we have 3 control plane nodes
count = 3
Reproducibility: Create identical environments
terraform apply # Creates exact same infrastructure
Version Control: Track every infrastructure change in Git
git log infrastructure.tf
# See who changed what and when
IaC Tools Comparison
| Tool | Purpose | Declarative/Imperative | Best For |
|---|---|---|---|
| Terraform | Infrastructure provisioning | Declarative | Multi-cloud infrastructure |
| CloudFormation | AWS infrastructure | Declarative | AWS-only projects |
| Ansible | Configuration management | Both | Server configuration |
| Pulumi | Infrastructure provisioning | Imperative (real code) | Developers who prefer code |
Declarative vs Imperative:
Imperative (How): Tell the computer HOW to do something
# Step by step instructions
create_vpc()
create_subnet()
create_instance()
Declarative (What): Tell the computer WHAT you want
# Describe desired state
resource "aws_vpc" "main" { ... }
resource "aws_subnet" "private" { ... }
resource "aws_instance" "app" { ... }
Terraform figures out HOW to create it.
Introduction to Terraform
What is Terraform?
Terraform is an open-source tool for building, changing, and versioning infrastructure safely and efficiently.
Created by HashiCorp, it works with:
- AWS, Azure, Google Cloud
- Kubernetes, Docker
- GitHub, Datadog
- 3,000+ providers
Core Terraform Concepts
1. Providers
Providers are plugins that interact with cloud platforms:
provider "aws" {
region = "us-east-1"
}
provider "kubernetes" {
host = "https://k8s-cluster.example.com"
}
2. Resources
Resources are infrastructure components:
# A resource has:
# - Type: aws_instance
# - Local name: web_server
# - Configuration: everything inside { }
resource "aws_instance" "web_server" {
ami = "ami-12345"
instance_type = "t3.small"
tags = {
Name = "My Web Server"
}
}
3. Variables
Variables make code reusable:
# variables.tf
variable "environment" {
description = "Environment name"
type = string
default = "dev"
}
# main.tf
resource "aws_instance" "server" {
instance_type = var.environment == "prod" ? "t3.large" : "t3.small"
tags = {
Environment = var.environment
}
}
4. Outputs
Outputs extract information:
output "server_ip" {
value = aws_instance.web_server.public_ip
}
# After terraform apply:
# server_ip = "54.123.45.67"
5. State
Terraform stores the current state of infrastructure:
# backend.tf
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "infrastructure/terraform.tfstate"
region = "us-east-1"
}
}
Why state matters:
- Knows what infrastructure exists
- Maps resources to real-world objects
- Tracks metadata and dependencies
- Enables collaboration (shared state)
Terraform Workflow
# 1. Write configuration
vim main.tf
# 2. Initialize (download providers)
terraform init
# 3. Preview changes
terraform plan
# 4. Apply changes
terraform apply
# 5. View state
terraform show
# 6. Destroy (cleanup)
terraform destroy
Example: Creating an AWS Server
# main.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = "us-east-1"
}
# Create VPC
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
tags = {
Name = "main-vpc"
}
}
# Create subnet
resource "aws_subnet" "public" {
vpc_id = aws_vpc.main.id # Reference to VPC above
cidr_block = "10.0.1.0/24"
tags = {
Name = "public-subnet"
}
}
# Create server
resource "aws_instance" "web" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.micro"
subnet_id = aws_subnet.public.id # Reference to subnet
tags = {
Name = "web-server"
}
}
# Output the server's IP
output "server_ip" {
value = aws_instance.web.public_ip
}
Run it:
terraform init
terraform plan
terraform apply
# server_ip = "54.123.45.67"
Terraform Modules
Modules are reusable Terraform code:
modules/
└── vpc/
├── main.tf # VPC resources
├── variables.tf # Inputs
└── outputs.tf # Outputs
environments/
├── dev/
│ └── main.tf # Uses vpc module
├── staging/
│ └── main.tf # Uses vpc module
└── prod/
└── main.tf # Uses vpc module
Using a module:
module "vpc" {
source = "../../modules/vpc"
cidr_block = "10.0.0.0/16"
environment = "dev"
}
# Access module outputs
output "vpc_id" {
value = module.vpc.vpc_id
}
Benefits:
- DRY (Don't Repeat Yourself)
- Tested, reusable components
- Easier to maintain
- Consistent across environments
Understanding Kubernetes
What is Kubernetes?
Kubernetes (K8s) is a container orchestration platform that automates deployment, scaling, and management of containerized applications.
Think of it as an operating system for your cluster:
- Without K8s: You manually start containers on servers
- With K8s: You declare what you want, K8s makes it happen
Why Kubernetes?
Problem without K8s:
Server 1: Running app-v1, app-v1, app-v2 (manual deployment)
Server 2: Running app-v1, crashed (needs manual restart)
Server 3: Idle (wasting money)
Solution with K8s:
3 Servers (cluster)
K8s automatically:
- Distributes containers evenly
- Restarts crashed containers
- Scales up/down based on load
- Updates apps with zero downtime
Kubernetes Architecture
┌─────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌───────────────────────────────────────────────┐ │
│ │ Control Plane (Master) │ │
│ │ ┌──────────┐ ┌───────┐ ┌──────────────┐ │ │
│ │ │ API │ │ etcd │ │ Scheduler │ │ │
│ │ │ Server │ │ │ │ Controller │ │ │
│ │ └──────────┘ └───────┘ └──────────────┘ │ │
│ └───────────────────────────────────────────────┘ │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Worker 1 │ │ Worker 2 │ │ Worker 3 │ │
│ │ │ │ │ │ │ │
│ │ Pods │ │ Pods │ │ Pods │ │
│ │ [App] │ │ [App] │ │ [App] │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────┘
Control Plane Components:
- API Server: Front-end for K8s (you talk to this)
- etcd: Database storing cluster state
- Scheduler: Decides which worker runs which pod
- Controller Manager: Maintains desired state
Worker Node Components:
- kubelet: Agent running on each node
- Container Runtime: Runs containers (Docker, containerd)
- kube-proxy: Network proxy for services
Key Kubernetes Concepts
1. Pods
Pod = Smallest deployable unit (1+ containers)
apiVersion: v1
kind: Pod
metadata:
name: nginx-pod
spec:
containers:
- name: nginx
image: nginx:1.25
ports:
- containerPort: 80
2. Deployments
Deployment = Manages multiple pods (replicas)
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
replicas: 3 # Run 3 copies
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.25
ports:
- containerPort: 80
What it does:
- Maintains 3 nginx pods
- If one crashes, creates a new one
- Rolling updates (update without downtime)
- Rollback if update fails
3. Services
Service = Stable network endpoint for pods
apiVersion: v1
kind: Service
metadata:
name: nginx-service
spec:
selector:
app: nginx
ports:
- port: 80
targetPort: 80
type: LoadBalancer # Creates AWS load balancer
Service Types:
- ClusterIP: Internal only (default)
- NodePort: Accessible on node IP:port
- LoadBalancer: Creates cloud load balancer
- ExternalName: DNS alias
4. Namespaces
Namespace = Virtual cluster (logical separation)
# System namespaces
kube-system # K8s components
kube-public # Public resources
default # Default namespace
# Custom namespaces
dev # Development apps
staging # Staging apps
prod # Production apps
apiVersion: v1
kind: Namespace
metadata:
name: production
# Deploy to namespace
kubectl apply -f app.yaml -n production
5. ConfigMaps and Secrets
ConfigMap = Configuration data
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
database_url: "postgres://db.example.com:5432"
log_level: "info"
Secret = Sensitive data (base64 encoded)
apiVersion: v1
kind: Secret
metadata:
name: db-secret
type: Opaque
data:
password: cGFzc3dvcmQxMjM= # base64 encoded
Using in pod:
spec:
containers:
- name: app
image: myapp:1.0
env:
- name: DATABASE_URL
valueFrom:
configMapKeyRef:
name: app-config
key: database_url
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: db-secret
key: password
Self-Managed vs Managed Kubernetes
| Aspect | Self-Managed (kubeadm) | Managed (EKS/GKE/AKS) |
|---|---|---|
| Setup | Manual (complex) | Automated (easy) |
| Control Plane | You manage | Cloud manages |
| Upgrades | Manual | Automated options |
| Cost | EC2 costs only | EC2 + control plane fee |
| Customization | Full control | Some limitations |
| Responsibility | Everything | Just worker nodes |
| Best For | Learning, custom needs | Production, simplicity |
Our Infrastructure uses self-managed (kubeadm) for:
- Full control and learning
- No control plane costs
- Understanding how K8s works internally
GitOps: The Modern Way
What is GitOps?
GitOps = Using Git as the single source of truth for infrastructure and applications.
Traditional Deployment:
Developer → kubectl apply → Kubernetes
GitOps Deployment:
Developer → Git commit → ArgoCD → Kubernetes
GitOps Principles
- Declarative: Desired state declared in Git
- Versioned: All changes tracked in Git
- Pulled: Agents pull changes (don't push)
- Continuously Reconciled: Auto-sync to match Git
GitOps with ArgoCD
ArgoCD is a GitOps tool for Kubernetes:
┌──────────┐ ┌──────────┐ ┌─────────────┐
│ Git │◄─────│ ArgoCD │─────▶│ Kubernetes │
│ Repo │ Pull │ Syncs │Apply │ Cluster │
└──────────┘ └──────────┘ └─────────────┘
│ │
│ Developer commits │
│ deployment.yaml │
│ │
└──────────────────────────────────────┘
ArgoCD detects change
and applies to cluster
Workflow:
Self-healing:
If someone manually changes K8s
ArgoCD detects drift
Reverts to Git state
ArgoCD syncs:
Applies changes to Kubernetes
Monitors rollout
Reports status back
ArgoCD detects change:
ArgoCD polls Git every 3 minutes
Sees new commit
Compares Git state vs Cluster state
Developer makes change:
# Update image version
vim kubernetes/app-deployment.yaml
git commit -m "Update app to v1.2.0"
git push
ArgoCD Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-app
namespace: argocd
spec:
# Where is the source?
source:
repoURL: https://github.com/myorg/my-repo
path: kubernetes/overlays/production
targetRevision: main
# Where to deploy?
destination:
server: https://kubernetes.default.svc
namespace: production
# How to sync?
syncPolicy:
automated:
prune: true # Delete resources not in Git
selfHeal: true # Revert manual changes
syncOptions:
- CreateNamespace=true
Benefits of GitOps
Security: No kubectl access needed
Developers → Git (with PR review)
ArgoCD → Kubernetes (with K8s credentials)
Developers never need cluster access
Developer Friendly: Use familiar Git workflow
# Same as code development
git checkout -b feature
# Make changes
git commit && git push
# Create pull request
# After approval, auto-deploys
Disaster Recovery: Recreate everything from Git
# Lost cluster? Just point ArgoCD to Git
# Everything recreates automatically
Easy Rollback: Revert Git commit
git revert HEAD
git push
# ArgoCD automatically rolls back
Audit Trail: Every change tracked in Git
git log kubernetes/
# See who deployed what and when
Real-World Lessons Learned
Lesson 1: The Missing IAM Permissions
What Happened
Worker nodes were stuck in "Pending" state for 4+ hours. All ArgoCD pods couldn't schedule.
kubectl get pods -n argocd
# All pods: STATUS=Pending for 4+ hours
Root Cause
The infrastructure uses a clever design:
- Control plane node creates a
kubeadm joincommand - Stores it in AWS Systems Manager (SSM) Parameter Store
- Worker nodes retrieve the command from SSM
- Use it to join the cluster
The problem: IAM role had AmazonSSMManagedInstanceCore policy, which only allows SSM Session Manager (SSH access), not SSM Parameter Store (storing/retrieving parameters).
Control Plane tried to:
aws ssm put-parameter --name "/cluster/join-command" ...
❌ AccessDenied: Not authorized to perform ssm:PutParameter
Worker Nodes waiting for parameter that was never created...
⏰ Waiting... 4 hours... still waiting...
The Fix
Added SSM Parameter Store permissions to IAM role:
statement {
sid = "SSMParameterStore"
actions = [
"ssm:GetParameter",
"ssm:PutParameter",
"ssm:DeleteParameter"
]
resources = [
"arn:aws:ssm:*:*:parameter/gitops-infra-*"
]
}
Lessons Learned
- IAM Policies are Specific:
AmazonSSMManagedInstanceCore≠ Parameter Store access- Read policy documentation carefully
- Test what the policy actually allows
- Understand Dependencies: Worker nodes depend on control plane's SSM parameter
- Document dependencies in code comments
- Test the full workflow
Validate IAM Before Launch:
# Test IAM permissions before deploying
aws ssm put-parameter --name "/test" --value "test" --type "String"
Check Logs Early:
# Don't wait 4 hours!
# Check logs immediately when pods are pending
kubectl describe pod <pod-name>
ssh to-node
sudo journalctl -u cloud-final | grep error
Lesson 2: Control Plane Taints
What Happened
Only one node in cluster (control-plane), but pods can't schedule on it.
kubectl get nodes
# NAME STATUS ROLES AGE
# control-plane-node Ready control-plane 5h
kubectl get pods -n argocd
# All pods: Pending
kubectl describe pod argocd-server-xxx
# Events: 0/1 nodes are available: 1 node(s) had untolerated taint
Root Cause
Kubernetes taints the control-plane node by default:
node-role.kubernetes.io/control-plane: NoSchedule
Why? Security and stability:
- Control plane runs critical K8s components
- User workloads might consume resources
- Separate control plane from workloads
The problem: No worker nodes exist, only control plane.
The Fix
Option 1: Remove taint (dev/testing only)
kubectl taint nodes --all node-role.kubernetes.io/control-plane-
Option 2: Add worker nodes (proper solution)
# Fix IAM + reboot workers
aws ec2 reboot-instances --instance-ids i-xxx i-yyy
Lessons Learned
- Cluster Size Matters:
- 1 node: Remove taint (development only)
- 3+ nodes: Keep taint (production practice)
Check Node Availability First:
kubectl get nodes
kubectl describe nodes
# Look for taints and allocatable resources
Understand Taints and Tolerations:
# Taint = "I don't want pods"
node-role.kubernetes.io/control-plane: NoSchedule
# Toleration = "I can handle that taint"
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
Lesson 3: Server-Side Apply for Large CRDs
What Happened
Installing ArgoCD failed:
kubectl apply -f install.yaml
# Error: The CustomResourceDefinition "applicationsets.argoproj.io" is invalid:
# metadata.annotations: Too long: must have at most 262144 bytes
Root Cause
Client-side apply (kubectl apply):
- Stores entire manifest in annotation
kubectl.kubernetes.io/last-applied-configuration - Used for three-way merge on updates
- ArgoCD CRDs are huge (>262KB)
- Exceeds Kubernetes annotation limit
Server-side apply (kubectl apply --server-side):
- Server tracks changes (no annotation needed)
- No size limit
- Modern approach (introduced K8s 1.16)
The Fix
# Old way (fails)
kubectl apply -f argocd-install.yaml
# New way (works)
kubectl apply --server-side=true --force-conflicts -f argocd-install.yaml
Lessons Learned
- Use Server-Side Apply for Large Resources:
- CRDs, especially from complex tools (ArgoCD, Istio, etc.)
- Avoids annotation size limits
- Better conflict detection
Handle Conflicts Gracefully:
# If switching from client-side to server-side
--force-conflicts # Take ownership of fields
Understand Apply Mechanisms:
Client-side apply:
- Three-way merge (current, desired, last-applied)
- Last-applied stored in annotation
- Legacy approach
Server-side apply:
- Server tracks field ownership
- No annotation needed
- Modern, recommended
Lesson 4: EC2 Instances Running ≠ Nodes Ready
What Happened
aws ec2 describe-instances
# i-xxx: running ✓
# i-yyy: running ✓
kubectl get nodes
# Only control-plane node (no workers)
EC2 shows instances running, but Kubernetes doesn't see them.
Root Cause
EC2 instance "running" just means:
- VM is powered on
- Operating system booted
It does NOT mean:
- Software installed correctly
- Node joined Kubernetes cluster
- Ready to run pods
The problem: kubeadm join command never ran successfully (due to missing SSM parameter).
Lessons Learned
Validate from Multiple Perspectives:
# AWS perspective
aws ec2 describe-instances
# Kubernetes perspective
kubectl get nodes
# Application perspective
kubectl get pods
Check User-Data Logs:
ssh to-instance
sudo cat /var/log/cloud-init-output.log
sudo journalctl -u cloud-final
EC2 Status vs Application Status:
EC2 Instance: running ✓
├─ OS Booted: yes ✓
├─ User-data script: failed ✗
└─ Application: not working ✗
Lesson 5: Bootstrap Order Matters
What Happened
Worker nodes tried to join before control plane was ready, or control plane couldn't store join command due to missing permissions.
Root Cause
Dependency chain:
1. IAM role created
↓
2. EC2 instances launch
↓
3. Control plane initializes
↓
4. Control plane stores join command (needs IAM)
↓
5. Worker nodes retrieve join command (needs IAM)
↓
6. Worker nodes join cluster
If step 4 or 5 fails, everything breaks.
Lessons Learned
Dependencies in Code:
resource "aws_instance" "worker" {
# ...
depends_on = [
aws_ssm_parameter.join_command # Explicit dependency
]
}
Add Retry Logic:
# In worker.sh
for i in {1..30}; do
JOIN_COMMAND=$(aws ssm get-parameter ...) && break
echo "Waiting for join command... ($i/30)"
sleep 10
done
IAM Propagation Delay:
IAM changes aren't instant
Wait 10-30 seconds after creating role
Before launching instances that use it
Bootstrap Order Matters:
# Correct order:
1. terraform apply (bootstrap/iam) # IAM roles
2. Wait for IAM propagation # 10-30 seconds
3. terraform apply (environments/dev) # Infrastructure
4. Verify control plane ready # Check logs
5. Verify workers joined # kubectl get nodes
How Everything Works Together
The Complete Infrastructure Flow
Let's trace how everything connects in our infrastructure:
┌─────────────────────────────────────────────────────────┐
│ 1. Developer │
│ │
│ git commit -m "Add new feature" │
│ git push origin main │
└────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ 2. GitHub Actions │
│ │
│ - Runs on: push to main │
│ - Authenticates: OIDC (no static keys!) │
│ - Validates: terraform plan │
│ - Applies: terraform apply (if approved) │
└────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ 3. Terraform │
│ │
│ - Reads: .tf files from Git │
│ - Plans: Calculate changes needed │
│ - Applies: Create/update AWS resources │
│ - State: Stored in S3 (gitops-infra-terraform-state) │
└────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ 4. AWS │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌────────────────┐ │
│ │ VPC │ │ EC2 Control │ │ EC2 Workers │ │
│ │ Subnets │ │ Plane │ │ (2 nodes) │ │
│ │ NAT/IGW │ │ (1 node) │ │ │ │
│ └─────────────┘ └──────┬───────┘ └───────┬────────┘ │
│ │ │ │
│ │ Runs kubeadm init │ │
│ │ │ │
│ ▼ │ │
│ ┌────────────────┐ │ │
│ │ SSM Parameter │◄─────────┘ │
│ │ /cluster/join │ Workers fetch │
│ └────────────────┘ join command │
└────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ 5. Kubernetes Cluster │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Control Plane (Master) │ │
│ │ - API Server (port 6443) │ │
│ │ - etcd (cluster state) │ │
│ │ - Scheduler (assigns pods to nodes) │ │
│ │ - Controller Manager (maintains state) │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ ┌────────────┐ ┌────────────┐ │
│ │ Worker 1 │ │ Worker 2 │ │
│ │ │ │ │ │
│ │ kubelet │ │ kubelet │ │
│ │ Pods: │ │ Pods: │ │
│ │ - ArgoCD │ │ - Apps │ │
│ └────────────┘ └────────────┘ │
└────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ 6. ArgoCD │
│ │
│ - Installed in: argocd namespace │
│ - Watches: GitHub repo (infrastructure) │
│ - Syncs: Every 3 minutes │
│ - Applies: Kubernetes manifests from Git │
│ │
│ Applications managed: │
│ ├── ingress-nginx (core) │
│ ├── cert-manager (core) │
│ ├── external-secrets (core) │
│ └── sample-app (application) │
└────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ 7. Application Deployment │
│ │
│ Developer commits: kubernetes/app-deployment.yaml │
│ ↓ │
│ ArgoCD detects change in Git │
│ ↓ │
│ ArgoCD applies to Kubernetes │
│ ↓ │
│ Deployment creates Pods │
│ ↓ │
│ Service exposes Pods │
│ ↓ │
│ Ingress routes traffic (HTTPS with cert-manager) │
│ ↓ │
│ Application running ✓ │
└─────────────────────────────────────────────────────────┘
Component Interaction Example
Let's trace a real deployment:
Scenario: Deploy new version of application
Day 1, 9:00 AM - Developer makes change:
─────────────────────────────────────────
vim helm/values/dev/values.yaml
# Change: image.tag: "1.1.0" → "1.2.0"
git commit -m "Update app to v1.2.0"
git push origin main
├─→ GitHub receives push
├─→ Triggers GitHub Actions workflow
├─→ Runs terraform plan (if Terraform changed)
└─→ No Terraform changes, skips apply
Day 1, 9:03 AM - ArgoCD detects change:
────────────────────────────────────────
ArgoCD polls GitHub every 3 minutes
├─→ Fetches latest commit
├─→ Compares Git state vs Cluster state
└─→ Detects difference: image tag changed
├─→ ArgoCD syncs application
├─→ kubectl apply -f deployment.yaml
├─→ Kubernetes creates new ReplicaSet
│ └─→ New ReplicaSet: image:1.2.0
│ Old ReplicaSet: image:1.1.0
│
├─→ Rolling update starts:
│ ├─→ Create 1 pod with new version
│ ├─→ Wait for pod ready
│ ├─→ Terminate 1 pod with old version
│ └─→ Repeat until all pods updated
│
└─→ ArgoCD monitors health
├─→ Checks pod status
├─→ Checks readiness probes
└─→ Reports: Healthy ✓
Day 1, 9:05 AM - Deployment complete:
──────────────────────────────────────
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# app-55c8d9f89d-abc12 1/1 Running 0 2m
# app-55c8d9f89d-def34 1/1 Running 0 2m
# app-55c8d9f89d-ghi56 1/1 Running 0 2m
kubectl get replicasets
# NAME DESIRED CURRENT READY AGE
# app-55c8d9f89d 3 3 3 2m (new)
# app-64b7c8d78e 0 0 0 5d (old)
ArgoCD UI shows:
├─→ Status: Synced + Healthy
├─→ Last Sync: 2 minutes ago
└─→ Git Commit: abc123 "Update app to v1.2.0"
The Security Flow
How authentication and authorization work:
┌─────────────────────────────────────────────────────────┐
│ Security Layers │
└─────────────────────────────────────────────────────────┘
1. GitHub Actions → AWS
─────────────────────────
GitHub Actions workflow runs
├─→ Uses: aws-actions/configure-aws-credentials
├─→ Method: OIDC (OpenID Connect)
│ ├─→ No static AWS keys in GitHub!
│ ├─→ GitHub provides JWT token
│ └─→ AWS validates token
│
└─→ Assumes IAM role: gitops-infra-github-actions-role
└─→ Role has: AdministratorAccess (Terraform needs it)
2. EC2 Instances → AWS Services
────────────────────────────────
EC2 instance launches
├─→ Attached: Instance Profile (gitops-infra-k8s-node-profile)
├─→ Profile contains: IAM Role (gitops-infra-k8s-node-role)
│
└─→ Role has permissions:
├─→ EC2: Describe instances, attach volumes
├─→ ECR: Pull container images
├─→ SSM: Session Manager + Parameter Store
└─→ ELB: Describe load balancers
3. kubectl → Kubernetes API
────────────────────────────
User runs kubectl command
├─→ Reads: ~/.kube/config
│ ├─→ Contains: API server URL
│ └─→ Contains: Client certificate
│
└─→ API Server validates:
├─→ Authentication: Is this a valid user?
│ └─→ Checks client certificate
│
└─→ Authorization: What can they do?
└─→ RBAC (Role-Based Access Control)
├─→ Roles: Define permissions
└─→ RoleBindings: Assign roles to users
4. Pods → AWS Services
───────────────────────
Pod needs to access AWS (e.g., S3, Secrets Manager)
├─→ Option 1: IRSA (IAM Roles for Service Accounts)
│ ├─→ Service Account annotated with IAM role
│ ├─→ Pod uses this Service Account
│ └─→ Automatically gets temporary AWS credentials
│
└─→ Option 2: Node IAM role (less secure)
└─→ Pod inherits node's IAM role permissions
The State Management Flow
How Terraform tracks infrastructure:
┌─────────────────────────────────────────────────────────┐
│ Terraform State Flow │
└─────────────────────────────────────────────────────────┘
1. Initial Setup
────────────────
terraform {
backend "s3" {
bucket = "gitops-infra-terraform-state-ACCOUNT_ID"
key = "environments/dev/terraform.tfstate"
region = "eu-west-1"
}
}
State file structure:
{
"version": 4,
"terraform_version": "1.5.0",
"resources": [
{
"type": "aws_instance",
"name": "control_plane",
"instances": [{
"attributes": {
"id": "i-0abc123",
"instance_type": "t3.small",
"private_ip": "10.10.10.5"
}
}]
}
]
}
2. Making Changes
─────────────────
Developer: vim main.tf
# Change: instance_type = "t3.medium"
terraform plan:
├─→ Downloads state from S3
├─→ Reads current .tf files
├─→ Queries AWS API for actual state
│
└─→ Compares three states:
├─→ State file: t3.small (last known)
├─→ Config file: t3.medium (desired)
└─→ AWS actual: t3.small (current)
Plan:
~ aws_instance.control_plane
~ instance_type: "t3.small" → "t3.medium"
terraform apply:
├─→ Acquires lock (DynamoDB)
│ └─→ Prevents concurrent applies
│
├─→ Makes changes in AWS
│ └─→ Modifies instance to t3.medium
│
├─→ Updates state file
│ └─→ Writes new state to S3
│
└─→ Releases lock
3. Team Collaboration
──────────────────────
Developer A (local):
├─→ terraform plan
├─→ Downloads latest state from S3
└─→ Sees: Instance is t3.medium
Developer B (GitHub Actions):
├─→ terraform apply
├─→ Acquires DynamoDB lock
├─→ Developer A's apply would wait or fail
└─→ Releases lock after complete
State locking prevents:
❌ Two people applying simultaneously
❌ Corrupted state files
❌ Lost changes
Best Practices
1. Infrastructure as Code
DO:
# ✓ Use modules for reusability
module "vpc" {
source = "../../modules/vpc"
cidr_block = var.cidr_block
}
# ✓ Use variables for different environments
variable "instance_type" {
default = "t3.small"
}
# ✓ Tag everything
tags = {
Environment = var.environment
ManagedBy = "terraform"
Project = var.project_name
}
# ✓ Use remote state
backend "s3" {
bucket = "my-terraform-state"
key = "path/to/state"
}
DON'T:
# ✗ Hardcode values
resource "aws_instance" "server" {
instance_type = "t3.small" # What about staging/prod?
}
# ✗ Store state locally
# Missing: backend configuration
# State only on your laptop = bad
# ✗ Use default VPC
resource "aws_instance" "server" {
# No vpc_id specified = uses default
# Default VPC is not production-ready
}
2. Security
DO:
# ✓ Least privilege IAM
data "aws_iam_policy_document" "app" {
statement {
actions = [
"s3:GetObject",
"s3:PutObject"
]
resources = [
"arn:aws:s3:::my-bucket/app-data/*"
]
}
}
# ✓ Encrypt everything
resource "aws_ebs_volume" "data" {
encrypted = true
}
# ✓ Private subnets for workloads
resource "aws_subnet" "private" {
map_public_ip_on_launch = false
}
# ✓ Use OIDC for GitHub Actions
# No static AWS keys in GitHub!
DON'T:
# ✗ Overly permissive IAM
policy_arn = "arn:aws:iam::aws:policy/AdministratorAccess"
# ✗ Unencrypted volumes
resource "aws_ebs_volume" "data" {
encrypted = false # Bad!
}
# ✗ Everything in public subnets
resource "aws_subnet" "public" {
map_public_ip_on_launch = true # For NAT/ALB only!
}
# ✗ Static credentials in GitHub
AWS_ACCESS_KEY_ID: "AKIAIOSFODNN7EXAMPLE" # Never!
3. Kubernetes
DO:
# ✓ Set resource limits
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 250m
memory: 256Mi
# ✓ Use health checks
livenessProbe:
httpGet:
path: /health
port: 8080
readinessProbe:
httpGet:
path: /ready
port: 8080
# ✓ Don't run as root
securityContext:
runAsNonRoot: true
runAsUser: 1000
# ✓ Use namespaces
metadata:
namespace: production
DON'T:
# ✗ No resource limits
# Missing resources: {}
# Pod can consume all node resources
# ✗ No health checks
# Missing probes
# K8s doesn't know if app is healthy
# ✗ Run as root
securityContext:
runAsUser: 0 # root = dangerous
# ✗ Everything in default namespace
# Missing namespace
# No isolation between apps
4. GitOps
DO:
# ✓ Environment-specific configurations
overlays/
dev/
kustomization.yaml
staging/
kustomization.yaml
prod/
kustomization.yaml
# ✓ Automated sync for dev/staging
syncPolicy:
automated:
prune: true
selfHeal: true
# ✓ Manual sync for production
syncPolicy:
# No automated sync
# Requires manual approval
# ✓ Use health checks
health:
status: Healthy
message: "Deployment has minimum availability"
DON'T:
# ✗ Same config for all environments
# One size doesn't fit all
# ✗ Auto-sync production without review
# Scary! Auto-deploy to prod without approval
# ✗ Ignore sync failures
# ArgoCD reports failure, but you don't notice
# ✗ Manual kubectl in production
# Bypasses GitOps, creates drift
5. Monitoring and Observability
DO:
# ✓ Structured logging
logger.info("User logged in", {
"user_id": user.id,
"ip": request.ip,
"timestamp": datetime.now()
})
# ✓ Metrics
prometheus.histogram("request_duration_seconds")
prometheus.counter("requests_total")
# ✓ Distributed tracing
@trace(service="api", operation="get_user")
def get_user(user_id):
...
# ✓ Alerts
- alert: HighErrorRate
expr: rate(errors_total[5m]) > 0.05
for: 10m
DON'T:
# ✗ Print statements
print("User logged in") # Lost in logs
# ✗ No metrics
# Can't see trends or anomalies
# ✗ No tracing
# Can't debug distributed systems
# ✗ No alerts
# Don't know when things break
6. Documentation
DO:
# ✓ Document decisions
## Architecture Decision Record (ADR)
- Date: 2024-01-15
- Decision: Use kubeadm instead of EKS
- Rationale: Cost savings, learning opportunity
- Consequences: More operational overhead
# ✓ Runbooks
## Deploying to Production
1. Create PR with changes
2. Wait for CI/CD to pass
3. Request review from team
4. Merge after approval
5. Monitor ArgoCD sync
6. Verify health checks
# ✓ Diagrams
[Architectural diagrams in docs/]
# ✓ Code comments for "why"
# We use server-side apply because CRDs exceed annotation limit
kubectl apply --server-side=true
DON'T:
# ✗ No documentation
# Just read the code!
# ✗ Outdated documentation
# Last updated: 2 years ago
# Doesn't match current setup
# ✗ Only "what" without "why"
# This creates a server
# (But why? What's it for?)
Next Steps
1. Complete the Current Setup
Fix the worker node issue:
# 1. Apply IAM changes
cd terraform/bootstrap/iam
terraform apply
# 2. Create SSM parameter
ssh to-control-plane
sudo kubeadm token create --print-join-command | \
aws ssm put-parameter --name "/gitops-infra-dev/kubeadm-join-command" ...
# 3. Reboot workers
aws ec2 reboot-instances --instance-ids i-xxx i-yyy
# 4. Verify
kubectl get nodes
# Should see 3 nodes: 1 control-plane + 2 workers
2. Learn by Doing
Exercise 1: Deploy Your First App
# Create a simple nginx deployment
kubectl create deployment nginx --image=nginx:1.25
kubectl expose deployment nginx --port=80 --type=NodePort
kubectl get svc nginx
# Access at: http://<node-ip>:<node-port>
Exercise 2: Use ArgoCD
# Create an ArgoCD application
cat <<EOF | kubectl apply -f -
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-app
namespace: argocd
spec:
source:
repoURL: https://github.com/yourusername/your-repo
path: kubernetes/my-app
targetRevision: main
destination:
server: https://kubernetes.default.svc
namespace: default
syncPolicy:
automated:
prune: true
selfHeal: true
EOF
Exercise 3: Make Infrastructure Changes
# 1. Create a branch
git checkout -b add-monitoring-namespace
# 2. Add a namespace
cat <<EOF >> kubernetes/base/namespaces/namespaces.yaml
---
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
labels:
name: monitoring
EOF
# 3. Commit and push
git add .
git commit -m "Add monitoring namespace"
git push origin add-monitoring-namespace
# 4. Create PR
# Watch GitHub Actions run
# Merge after approval
# Watch ArgoCD sync
3. Expand Your Knowledge
Topics to Study:
- Kubernetes Deep Dive
- Kubernetes Official Docs
- Kubernetes the Hard Way
- Practice: CKA (Certified Kubernetes Administrator)
- Terraform Mastery
- Terraform Best Practices
- Terraform Registry
- Practice: Build multi-tier AWS infrastructure
- GitOps
- ArgoCD Documentation
- Flux CD (Alternative to ArgoCD)
- Practice: Set up a complete GitOps workflow
- Observability
- Prometheus + Grafana for metrics
- ELK/EFK stack for logging
- Jaeger for distributed tracing
- Practice: Add monitoring to your cluster
- Security
- CIS Kubernetes Benchmark
- OWASP Top 10
- Practice: Harden your cluster
4. Build Real Projects
Project Ideas:
- Personal Blog Platform
- Deploy WordPress on Kubernetes
- Use MySQL database
- SSL with cert-manager
- GitOps with ArgoCD
- Microservices Application
- Multiple services (frontend, backend, database)
- Service mesh (Istio/Linkerd)
- Distributed tracing
- CI/CD pipeline
- Monitoring Stack
- Prometheus for metrics
- Grafana for dashboards
- AlertManager for alerts
- Loki for logs
- Multi-Region Deployment
- Deploy to multiple AWS regions
- Use Route53 for DNS failover
- Database replication
- Disaster recovery plan
5. Get Certified
Recommended Certifications:
- AWS Certified Solutions Architect
- Validates AWS knowledge
- Industry recognized
- Good for career growth
- Certified Kubernetes Administrator (CKA)
- Hands-on exam
- Proves K8s operational skills
- Highly valued
- Terraform Associate
- Official HashiCorp certification
- Validates IaC skills
- Growing demand
6. Join the Community
Resources:
- Reddit: r/kubernetes, r/devops, r/terraform
- Discord: Kubernetes, CNCF, DevOps
- Twitter: Follow #kubernetes, #devops, #terraform
- Meetups: Local DevOps/Kubernetes meetups
- Conferences: KubeCon, HashiConf, DevOpsDays
7. Keep a Learning Log
Document your journey:
# DevOps Learning Log
## 2024-01-15: Fixed Worker Node Issue
- Problem: Nodes stuck in pending
- Root cause: Missing IAM permissions
- Lesson: Always verify IAM before deploying
- Next: Study IAM policies in depth
## 2024-01-16: Deployed First App with ArgoCD
- Created simple nginx deployment
- Set up ArgoCD sync
- Watched automatic deployment
- Next: Deploy multi-tier app
## 2024-01-17: Implemented Monitoring
- Installed Prometheus
- Created Grafana dashboards
- Set up alerts
- Next: Add distributed tracing
Conclusion
What You've Learned
In this journey, you've learned:
- DevOps Fundamentals
- Automation, CI/CD, IaC
- Collaboration between dev and ops
- Version control for everything
- Infrastructure as Code
- Terraform basics and advanced concepts
- Modules, state, and providers
- Multi-environment management
- Kubernetes
- Architecture and components
- Pods, Deployments, Services
- Self-managed vs managed clusters
- GitOps
- ArgoCD and continuous deployment
- Declarative configuration
- Self-healing and automated sync
- Real-World Troubleshooting
- IAM permissions debugging
- Node scheduling issues
- CRD size limitations
- Bootstrap order dependencies
The DevOps Mindset
Remember these principles:
- Automate Everything
- If you do it twice, automate it
- Manual processes are error-prone
- Automation enables scale
- Version Control Everything
- Code, infrastructure, configs
- Git is the source of truth
- Audit trail for all changes
- Fail Fast, Learn Fast
- Test in dev/staging first
- Failures are learning opportunities
- Iterate quickly
- Measure Everything
- Metrics, logs, traces
- You can't improve what you don't measure
- Data-driven decisions
- Never Stop Learning
- Technology evolves rapidly
- Always be curious
- Share knowledge with others
Your Infrastructure
Your current setup is production-ready with:
- ✓ Multi-environment Terraform (dev/staging/prod)
- ✓ Self-managed Kubernetes cluster
- ✓ GitOps with ArgoCD
- ✓ CI/CD with GitHub Actions
- ✓ Security best practices (IAM, OIDC, encryption)
- ✓ Comprehensive documentation
Next steps:
- Fix the worker node issue (follow the guide above)
- Deploy your first application
- Expand to staging and production
- Add monitoring and logging
- Keep learning and improving
Final Thoughts
DevOps is a journey, not a destination. You'll encounter problems, make mistakes, and learn continuously. That's normal and healthy.
What matters:
- Keep building
- Keep learning
- Keep improving
- Keep sharing
Welcome to the world of DevOps! 🚀
Additional Resources
Books
- "The Phoenix Project" by Gene Kim - DevOps novel
- "Kubernetes Up & Running" by Kelsey Hightower
- "Terraform: Up & Running" by Yevgeniy Brikman
- "Site Reliability Engineering" by Google
Online Courses
- A Cloud Guru - AWS, Kubernetes, Terraform
- Linux Academy - DevOps learning paths
- Udemy - Specific technology courses
- Coursera - University-level courses
Hands-On Labs
- Katacoda (now O'Reilly) - Interactive scenarios
- Play with Kubernetes - Free K8s playground
- AWS Free Tier - Practice on real cloud
- KillerCoda - K8s and cloud labs
Documentation
Tools to Explore
- kubectl - K8s CLI
- helm - K8s package manager
- k9s - Terminal UI for K8s
- lens - K8s IDE
- terraform - IaC tool
- terragrunt - Terraform wrapper
- docker - Containerization
- git - Version control
Remember: Every expert was once a beginner. Keep learning, keep building, and don't be afraid to break things (in dev environment)!
Happy learning! 🎓