The Equifax breach, one of the largest in history, happened because of a single unpatched web server. Patching is non-negotiable for security, yet it remains the most terrifying task for sysadmins. "If I run yum update, will my database restart? Will PHP break? Will the server never come back up?" This fear leads to "Patch Paralysis," leaving high-priority CVEs exposed for months.
The Strategy: Automation with Constraints
Manual patching is error-prone and unscalable. You cannot SSH into 50 servers. You need intelligent automation.
- Unattended-Upgrades (Debian/Ubuntu): Configure this immediately. Set it to automatically install Security updates only. Leave feature updates (which might break configs) for manual review, but let critical security fixes flow freely.
- KernelCare / Canonical Livepatch: In high-availability environments, rebooting for a kernel update is costly. These services patch the running kernel memory on-the-fly without a reboot, ensuring you stay secure against "Dirty COW" or generic exploits without downtime.
The Staging Buffer
Rule #1: Never patch production first. Create a "Patch Train":
- Dev/Staging: Updates apply automatically every night. If your staging environment is broken in the morning, you know the latest patch is toxic.
- The Canary: Identify one production server (or a small percentage of traffic). Apply the patch there first and monitor for 24 hours.
- The Fleet: Only after the Canary survives does the rest of the fleet update. Tools like Ansible or SaltStack can orchestrate this rollout precisely.
The Modern Way: Immutable Infrastructure
The "Cloud Native" approach solves patching by removing the concept entirely. Never update a running server. Instead of patching Server A:
- Build a new machine image (AMI/Snapshot) with the latest OS and application code baked in.
- Spin up Server B using this new image.
- Run automated health checks.
- Update the Load Balancer to point to Server B.
- Terminate Server A.
This ensures zero "configuration drift." You know exactly what is running because it was built from code in a clean environment, not a server that has been manually tweaked for 3 years.
Managing Downtime: Grouping and Draining
If you must update existing servers (Pet vs Cattle approach):
- Database Clusters: Patch the Replicas first. Promote a Replica to Primary. Then patch the old Primary.
- Web Clusters: Use a Load Balancer. Drain connections from Node 1 (wait for active requests to finish). Patch and reboot Node 1. Wait for health checks to pass. Re-enable Node 1. Move to Node 2. This "Rolling Update" strategy ensures end-users never see a 503 Service Unavailable error.
