InfrastructureApril 3, 20267 min read

What self-healing infrastructure actually means

Every cloud provider and DevOps vendor uses the term "self-healing." Most of the time, it means "we restart crashed containers." That is the bare minimum, not self-healing. Real self-healing infrastructure is a layered set of engineering patterns that detect failures, isolate them, recover automatically, and prevent recurrence - all without a human touching a keyboard. Here is what each layer looks like and how it works.

Layer 1: Health probes and automatic restart

The foundation is knowing when something is broken. Kubernetes provides two probe types that serve different purposes, and getting the distinction right matters.

A liveness probe answers one question: is this process fundamentally broken? It hits an endpoint (typically /healthz) and expects a 200 response. If the probe fails three consecutive times (configurable via failureThreshold), the kubelet kills the container and restarts it. This catches deadlocked processes, memory leaks that have corrupted state, and unrecoverable runtime errors. The key design decision: the liveness endpoint should check only whether the process itself is functional, not its dependencies. If your liveness probe checks the database and the database is down, Kubernetes will restart your perfectly healthy application in a loop, making things worse.

A readiness probe answers a different question: can this instance handle traffic right now? It hits a separate endpoint (typically /ready) that checks whether the service can actually serve requests - database connection is live, cache is warm, required config is loaded. When a readiness probe fails, the pod is removed from the Service's endpoint list. No traffic is routed to it. But the pod is not restarted. It stays running and keeps being probed. When it recovers - the database reconnects, the cache repopulates - it passes the readiness check and traffic resumes automatically.

This distinction is the first layer of self-healing: broken processes get killed and restarted; temporarily unhealthy instances get removed from rotation and re-added when they recover. No human involved.

Layer 2: Circuit breakers

Health probes handle failures within a service. Circuit breakers handle failures between services. Without them, a single slow or failing upstream service drags down everything that depends on it.

The pattern works like an electrical circuit breaker. In the closed state (normal operation), requests flow through. The circuit breaker tracks failure rates - timeouts, 5xx responses, connection errors. When the failure rate exceeds a threshold (say, 50% of requests in the last 30 seconds), the breaker trips open. All subsequent requests to that upstream are immediately rejected with a fallback response - a cached value, a degraded response, or a fast error. No more requests accumulate against the failing service.

After a configured timeout (e.g., 30 seconds), the breaker enters a half-open state and allows a single probe request through. If it succeeds, the breaker closes and normal traffic resumes. If it fails, the breaker opens again and the timeout resets.

In a service mesh like Istio, circuit breakers are configured at the infrastructure level via DestinationRule resources. You define outlierDetection with parameters like consecutive errors before ejection, ejection duration, and the percentage of hosts that can be ejected simultaneously. The application code doesn't need to know circuit breakers exist - the Envoy sidecar proxy handles it transparently.

This is the second layer: when a dependency fails, the system isolates the failure instead of propagating it. Services degrade gracefully. When the dependency recovers, traffic resumes. No human involved.

Layer 3: Auto-scaling on real metrics

CPU-based auto-scaling is the default, and it's often wrong. A Node.js service might be at 20% CPU but completely saturated because the event loop is blocked. A Rust service might be at 90% CPU but handling requests perfectly because it's doing genuine computational work. CPU alone doesn't tell you whether the service needs more replicas.

Effective auto-scaling uses application-level metrics exposed through Prometheus and consumed by KEDA (Kubernetes Event-Driven Autoscaler) or custom HPA (Horizontal Pod Autoscaler) metrics. The metrics that actually matter:

  • Request queue depth. How many requests are waiting to be processed. If this number is growing, you need more capacity.
  • P99 response latency. When latency crosses a threshold, scale out before users notice. A target of "keep p99 under 200ms" is a more meaningful scaling signal than "keep CPU under 70%."
  • Consumer lag (for queue-based services). If your Kafka consumer lag is growing, you need more consumers. KEDA can scale pods based on consumer group lag directly.
  • Active connection count. For WebSocket services or long-lived connections, the number of active connections is a better capacity signal than CPU or memory.

Scaling policies should include both scale-up and scale-down behaviour. Scale up aggressively (add 3 replicas immediately when the threshold is breached), scale down conservatively (remove 1 replica every 5 minutes after the metric stabilises). This asymmetry prevents flapping - rapid scale-up/scale-down cycles that destabilise the cluster.

Layer 4: Database failover

A self-healing application layer is useless if the database is a single point of failure. Automated database failover requires a primary-replica topology with automatic promotion.

In PostgreSQL, this is typically handled by Patroni or the CloudNativePG operator in Kubernetes. The pattern: a primary node handles writes and streams WAL (Write-Ahead Log) records to one or more synchronous replicas. A consensus mechanism (etcd, Consul, or Kubernetes leader election) monitors the primary. If the primary stops responding - network partition, hardware failure, OOM kill - the consensus mechanism promotes the most up-to-date replica to primary within seconds. The connection pooler (PgBouncer or Pgpool) updates its routing to point at the new primary. Applications reconnect through the pooler and continue operating.

The critical detail is synchronous vs asynchronous replication. Synchronous replication guarantees zero data loss on failover (the replica has every committed transaction) but adds latency to writes because the primary waits for the replica to confirm. Asynchronous replication is faster but can lose the most recent transactions during failover. The choice depends on whether you can tolerate any data loss. For financial systems, synchronous. For session stores, asynchronous is usually fine.

Layer 5: Deployment rollback on error rate spike

A deployment that increases the 5xx error rate from 0.1% to 5% should not stay running. Progressive delivery tools like Argo Rollouts and Flagger automate this decision.

The mechanism: instead of replacing all pods at once, the new version is deployed as a canary receiving a small percentage of traffic (typically 5-10%). The rollout controller queries Prometheus for error rate and latency metrics over a configurable analysis window (e.g., 5 minutes). If the canary's error rate exceeds the defined threshold, the rollout is automatically aborted and all traffic is routed back to the stable version. If the canary passes, traffic is shifted incrementally - 10%, 25%, 50%, 100% - with analysis at each step.

This means a bad deployment is detected and rolled back in minutes, not hours. The blast radius is limited to the canary percentage. No human needs to watch dashboards during deployment.

Layer 6: Infrastructure drift detection

Infrastructure defined in Terraform or Pulumi represents the intended state. The actual state drifts. Someone manually changes a security group rule. An auto-applied cloud provider update modifies a managed resource's configuration. A team member runs kubectl apply directly instead of going through the pipeline.

Drift detection runs a scheduled reconciliation loop. Pulumi's pulumi refresh or Terraform's terraform plan -detailed-exitcode compares the live state against the declared state. If drift is detected, the system can either alert (conservative approach) or automatically re-apply the declared state (aggressive approach). For security-critical resources like IAM policies, firewall rules, and encryption settings, automatic remediation is appropriate - you don't want a manually opened port to stay open until someone notices.

GitOps controllers like ArgoCD take this further by continuously reconciling the cluster state against the Git repository. Any manual change to the cluster is automatically reverted to match the declared state in Git. The Git repository becomes the single source of truth, and the infrastructure heals itself back to the intended configuration.

Layer 7: Alert-driven automated remediation

The final layer connects monitoring to action. An alert fires, and instead of paging a human, it triggers an automated runbook.

Example pipeline: Prometheus detects that disk usage on a persistent volume exceeds 85%. It fires an alert to Alertmanager. Alertmanager routes the alert to a webhook that triggers a Kubernetes Job. The Job runs a script that expands the PVC (Persistent Volume Claim) by 20%, verifies the filesystem resized correctly, and posts a summary to Slack. Total time: 90 seconds. Without automation, this is a 15-minute task assuming someone is awake and available.

More sophisticated pipelines handle certificate rotation (renew 30 days before expiry via cert-manager), log volume spikes (automatically increase log retention storage or adjust sampling rates), and node failures (cordon the node, drain workloads, provision a replacement through the cluster autoscaler).

The key constraint: automated remediation must be idempotent and bounded. Running the same remediation twice should not cause harm. And every automated action should have a circuit breaker of its own - if the remediation has fired 3 times in an hour for the same issue, stop and page a human. Infinite automated loops are worse than the original problem.

This is what "fully automated" means

Self-healing infrastructure is not a product you buy. It is an engineering discipline applied across every layer of the stack: process health, service-to-service resilience, capacity management, data layer redundancy, deployment safety, configuration integrity, and operational response.

Each pattern is well-understood and individually straightforward. The engineering challenge is composing them into a system where failures are detected in seconds, isolated immediately, and resolved automatically - with humans notified after the fact rather than during the crisis.

That is not marketing. That is engineering.

Kaev builds infrastructure that recovers from failure without human intervention. If your systems still rely on someone being awake at 3am, let's fix that.

Back to blog