Rob Panico

Introducing redundant ingress and storage after the cluster has learned to stay boring

Once a Kubernetes cluster can survive the addition and removal of nodes without drama, it crosses an important psychological boundary. You stop treating machines as special. You stop narrating every change. The system begins to feel like infrastructure rather than a project.

That’s when the next illusion tends to creep in: the belief that the system is now redundant.

It isn’t. Not yet.

What you have at this point is a cluster that can coordinate work across multiple machines. What you do not yet have is a system that can tolerate the loss of the two things that matter most: the way traffic gets in, and the place data lives. In other words, the front door and the basement.

This article is about making those boring — carefully, incrementally, and without pretending you’ve solved problems you haven’t.

Ingress is usually the first place redundancy gets misunderstood.

By default, most clusters begin with a single ingress controller running on a single node. It may be managed by Kubernetes, it may restart automatically, and it may even feel robust. But if that node disappears, traffic stops. The cluster can be perfectly healthy internally while being effectively unreachable.

The temptation at this stage is to introduce an external load balancer and declare victory. That can work, but it also moves the problem somewhere else. Before doing that, it’s worth exhausting what Kubernetes itself already gives you.

The simplest and most honest first step is to run more than one ingress controller replica and allow it to land on different nodes. This is not about clever routing. It’s about physics. Two processes on two machines fail differently than one process on one machine. When one node reboots, the other keeps listening. When Kubernetes reschedules, traffic continues to flow.

At this stage, ingress redundancy is internal. Requests still arrive at a single IP address, typically bound to one machine or one network interface. That’s fine. The goal is not perfect availability. The goal is to remove accidental single points of failure before introducing intentional ones.

Once multiple ingress replicas are running calmly, you turn your attention outward. Now the question becomes: how does traffic reach the cluster if the node currently holding the ingress IP disappears? This is where approaches diverge based on environment.

In some on-prem setups, a simple virtual IP managed by keepalive-style tooling is enough. In others, a small hardware or software load balancer sits in front of the cluster and forwards traffic to whichever node is healthy. The important part is not the product choice; it’s the boundary. You want ingress to fail over rather than fail closed.

At no point should ingress start behaving mysteriously. You should always be able to say, in plain language, “traffic comes in here, and if that machine goes away, it goes over there.”

If you can’t say that, you’ve introduced abstraction faster than understanding.

Storage is harder, and pretending otherwise is how people lose data.

Up to this point, your workloads may still be relying on node-local storage. That’s acceptable early on. It’s even healthy. It forces you to see clearly which workloads are actually stateful and which ones only feel that way.

The moment you add a second node, Kubernetes stops promising that a pod will come back on the same machine. If the storage it depends on is tied to a specific node, Kubernetes can restart the pod somewhere else — and the data simply won’t be there.

That’s not a Kubernetes bug. It’s Kubernetes telling the truth.

The first step toward replicated storage is not replication. It’s admission. You explicitly decide which data must survive node loss and which data can be rebuilt. Caches, scratch space, and derived artifacts often don’t need replication at all. Databases and user uploads almost always do.

Only after that distinction is clear do you introduce shared storage.

In on-prem environments, this often starts with a deliberately modest choice: a small distributed storage system running inside the cluster, rather than an external enterprise appliance. Systems like this trade raw performance for simplicity and clarity, which is exactly what you want early on.

The goal here is not infinite scalability. It’s to ensure that when a node disappears, the data does not disappear with it.

At first, replication factors are small. Two copies instead of one. You lose a disk, not a dataset. You reboot a machine, not a business. When something breaks — and something will — you understand what broke and why.

This phase teaches a critical lesson: storage failures are slower and quieter than compute failures. They don’t announce themselves with crashes. They show up as latency, partial reads, or subtle corruption if you move too fast. That’s why you keep the system small and boring while learning.

There is an important asymmetry between ingress and storage that’s easy to miss.

Ingress redundancy is about continuity. Storage redundancy is about truth.

If ingress fails, users are annoyed. If storage fails, reality forks.

That’s why you replicate ingress first. You let the system learn how to route around failure. Only then do you teach it how to preserve state across failure. Reversing that order leads to brittle systems that keep data safe but can’t be reached — or worse, systems that are reachable but quietly losing consistency.

As storage becomes replicated, you resist the urge to declare success. Replication does not mean correctness. It means you’ve bought time. The real work is learning how your system behaves when disks fall behind, when nodes rejoin, and when replication lags.

Again, boredom is the signal. A storage system that makes you nervous is not ready to be trusted.

By the time both ingress and storage are replicated calmly, something subtle has changed. You stop thinking in terms of “this machine hosts that thing.” Instead, you start thinking in terms of capabilities the cluster provides. Traffic ingress is no longer a place; it’s a function. Storage is no longer a disk; it’s a promise with known failure modes.

This is the moment when higher-level conversations become safe.

Hybrid cloud. Burst capacity. Disaster recovery. None of these are architectural leaps anymore. They’re extensions of patterns you’ve already practiced locally. The cloud stops feeling like a different world and starts feeling like another failure domain — one you can choose to involve or ignore.

But that only works because you didn’t rush here.

You taught the cluster to stay boring first. Then you taught the front door not to slam shut when a machine disappears. Then you taught the basement not to collapse when a disk goes dark.

Everything after that is just repetition at a larger radius.

Instructions