As your organization gets more comfortable with Kubernetes in development, you’ll want to prepare to adopt it in production. But mastering Kubernetes in dev does not necessarily translate into mastering it in prod. There are many additional components that must be configured and fine tuned to ensure reliable, self-healing production clusters.

In this blog, we’ll walk through the key elements of a Kubernetes production setup. In follow up blogs, we’ll deep dive into each of these considerations.

Reliability and self-healing

If you are like most companies, your team can’t afford to handle any downtime issues associated with your virtual or physical machines or Kubernetes components in a manual fashion. If something goes down, perhaps due to a kernel deadlock, disk corruption, or unexpected impact new application has on a system, it’s critical that you get it up and running as soon as possible and preferably automatically, without manual intervention.

This is particularly true if you’re looking to guarantee an SLA, humans can’t possibly react fast enough. Likewise, end users don’t tolerate delays or service disruptions anymore. If your service is down, they may consider switching to a competitor. Today, more than ever, reliability is a requirement, not a differentiator.

Let’s stay with the SLA example as it is top of mind for many of our users. To guarantee it, you have two options:

  1. Select a provider who will manage your Kubernetes deployments on your behalf, such as cloud-hosted Kubernetes.
  2. Set Kubernetes up in your infrastructure to ensure it can recover automatically. Known as self-managed Kubernetes.

Whether you select managed or self-managed Kubernetes largely depends on your requirements and how much control you need. The key difference lies in the masters. While you are responsible for managing your worker nodes — after all, that’s where the actual application runs — with managed Kubernetes offerings the provider controls your masters. You can’t just alter the configuration, you’re stuck with what the provider predetermined. And that’s fine for many use cases, but if you want to customize your Kubernetes cluster in a way the managed provider does not support, such as alpha features, managed Kubernetes is not an option. For more on reliable, self-healing Kubernetes clusters, read this blog.

A Nice UI is Great, But…

There is a lot of talk about the user interface (UI). A user-friendly UI is great, but you won’t use it in production, at least you shouldn’t if you follow DevOps best practices.

Anything that is running in production, should be codified in version control. Kubernetes and infrastructure configurations, as well as manifests, should be controlled through a DevOps pipeline or established GitOps processes, not manually through the UI. Altering Kubernetes objects, application manifests, and other service components through kubectl or the UI is not recommended.

In short, while you’ll use all the UI configuration options in development, you’ll script everything or automate it in production. At this point, the UI becomes an afterthought.

Security

While in development, security components may be optional, they certainly aren’t in production. Whether TLS, keys, certificates and secrets in general, pod security and network policies, or zero trust environment configuration, you’ll need to ensure these are all properly set up and configured as they are all essential in production environments.

And then there is your team. Enterprises must manage roles and access of potentially large teams. While in most dev clusters you simply create cluster-admin roles and broadly provide access to the application or even the entire cluster, you can’t do that in production. Only those who really need access should have access to a production cluster. If they only need to view it, they should get read-only access, etc. You need fine-grained control to properly manage permissions and a control mechanism to ensure you don’t forget who has access to what. Ideally, you’ll manage it all through a tool such as Active Directory or your identity management system connected to your Kubernetes cluster.

Audit is another important component of security in general. Any change, no matter how small, should be logged for anomalies and intrusion detection, regulatory requirements, and troubleshooting.

Last but not least, you must manage the system operational data — specifically, collected logs and possibly metrics (although metrics are usually less sensitive).

Ops and management

Once your Kubernetes production clusters are up and running, to keep control of all your clusters you’ll need to set up the appropriate logging and monitoring tools. This will ensure you can identify potential issues early on, ideally before your customers are impacted.

And, because you can’t rely on having eyes glued to the dashboard 24/7, you need a smart alerting system that triggers the right alert for different scenarios. Smart is an important property for an alerting system. Failures often produce a cascade effect and alerting systems can quickly overwhelm operators if they are unable to filter out low level from high level failures and escalate alerts correctly.

Smart means only alerting operators at a high priority if the system cannot recover automatically. Even if your system recovers automatically, you need to be notified to find out what happened and why. The right dashboards with the right data will help you predict a potential disaster so you can take corrective action versus simply reacting to it.

Governance Requirements

In development, there is no right and wrong when it comes to governance. Programmers often run applications from any repository and there are no resource limits. In production, however, any deviation from governance can translate into service disruption or a weak security link. That’s why you need to enforce everything: what kind of images to run, determining resource limits for each app to avoid node overloading and proper scaling, and isolation of network policies for communication between apps.

To use a cooking analogy, after you have written your secret recipe (use version control), and followed it when preparing a dish (use “everything as code” approach); proper governance then ensures that your products and kitchen tools are safe and in proper working order.

In short, you need a system that supports the enforcement of these policies. If you have an easy to use development environment, it should allow you to script those policies in QA and add them to your production automation pipeline, enforcing them on all clusters created down the line.

Scalable Kubernetes Clusters

We know Kubernetes is great for scaling, but that’s not the only piece that needs to scale to scale your apps. Your database, storage, and any external services used by your applications must also scale. Whichever technologies you use for your cloud native apps, ensure they are scalable, support dynamic loading, and can span among regions.

There is a lot more to running Kubernetes in production than in development. Before you go live for the first time, make sure you take all these considerations into account. Otherwise, you may risk running into issues, possibly even service disruptions.

In the coming weeks, we’ll post blogs that go into each of these areas in more detail. Sign up for our newsletter in the footer, follow us on Twitter, or just check back for our next blog in our Kubernetes in prod series.