Cloud-native technologies are radically changing the IT landscape. Developed for the cloud, cloud-native is certainly not cloud-bound. Increasingly more companies seek to benefit from cloud-like features in their own data centers. While challenging to implement, it’s not impossible – even in highly restrictive and isolated environments.
Kubernetes can be challenging, no matter where you deploy. In addition to quarterly releases, Kubernetes cluster security is a big challenge for SRE and DevOps teams. Teams who migrate their workloads to highly available, redundant Kubernetes clusters will enjoy all the features but must also maintain a high level of security standards and keep all cluster components up to date.
If you intend to deploy in environments with additional infrastructure implementation restrictions, including no-public-internet-access or corporate policies that demand high-security standards, the installation, and maintenance of Kubernetes becomes increasingly complex and costly in terms of labor hours. Infrastructure engineers must keep local mirrors of build dependency packages and docker images up to date, scan every introduced open source component for vulnerabilities, use DMZ bastion hosts for external connections, and more.
While creating a cluster is easy with tools like kops, kubeadm, and Ansible playbooks, every project has workloads to deploy, stress-test, optimize, secure, audit, and maintain clusters.
In this article, we’ll share some lessons learned and best practices from the Kublr team’s experience with financial services firms. We’ll review common requirements for secure or isolated environments, and how organizational processes, teams, and available existing infrastructure affect them. What does it mean to have a ‘production-grade cluster’? What role do managed solutions have and what are their limitations? We’ll also examine how teams are generally organized and how this affects requirements and the solution. Finally, we’ll dive into issues specific to on-premises installations.
Enterprise governance and security requirements
A good starting point for our discussion is system requirements for production clusters. They include:
- External and internal security. Protected communication between all cluster and application components. Information encrypted in transit and at rest. Authentication and authorization of all users and clients and enterprise-wide identity management.
- Audit of all operations with the cluster. This is a crucial step for larger organizations and must be implemented before deploying to production.
- Log collection, observability, and monitoring. You should be able to quickly identify issues and anomalies in your application’s behavior and be alerted to take corrective action. With logs and metrics in place, you’ll know if you can meet your service level agreement (SLA), and what are your service level objectives (SLO) and service level indicators (SLI) are.
- Support for isolated environments (aka environments that aren’t connected to the internet). While not relevant for all organizations, this is a common requirement for financial services or other highly regulated organizations. Support for these types of environments requires additional efforts. Docker images and binaries must be delivered to the isolated environment. The installation method must allow easy modification of Docker image registries in the Kubernetes manifests, add-ons, and other deployment files.During the cluster setup, each simple task will have to be split into several steps, which can be a challenge for organizations that aren’t properly prepared. For instance, you’ll first need to download all required Docker images, binaries, Helm charts or additional Kubernetes manifests. These artifacts will then have to be stored in the local mirror or a Docker registry and artifact storage platform (e.g. Nexus, Artifactory or Harbor). If some of these are not yet available inside the local network, they must be installed, becoming additional software for SRE or DevOps teams to maintain.
- Compatibility with existing CI/CD tooling. At the very least, the installation method should enable automation.
Additional Detail on Security Requirements
Security in the Kubernetes world can mean a lot. At the very least, you’ll want to integrate with LDAP or Active Directory to allow existing users to interact with the clusters. Manual account management is a path to disaster. Generally, you already have single sign-on (SSO) in your organization, so you’ll need a simple way to integrate with your existing solution using OpenID Connect or SAML.
Then, you’ll need to properly determine role-based access control requirements for clusters and dashboards. For user self-service capabilities, custom development may be required. Since each cluster has sensitive data like SSH private keys for provisioning, public cloud accounts, certificates for TLS and authentication, this data must be encrypted and access to it strictly controlled by per group or per user permissions.
While having no audit support may be acceptable for QA and Staging environments, it is certainly not for production. Just ask your SecOps team — they will never greenlight it! Particularly if you store, process, and transmit sensitive data like financial transactions or health records. Configure audit to log every action performed at the cluster and application level (like login attempts, config maps modification, pod deletion or creation, deployment scaling or editing) and send these logs in real time to ELK/Splunk or any other centralized logging system your organization may use. Based on these logs, the Operations team will define alerts for intrusion detection and to prevent any malicious activity from outside and inside the cluster.
The number of Kubernetes clusters, worker nodes, and running applications can be numerous. While it may feel like a huge endeavor to log all interactions between components and users or scripts, quadrupling the logging cluster capacity need to enable auditing, it is not. Specific policies with rules and audit levels can be defined to record only the bare minimum data which is necessary for each resource.
There are four logging levels for audit events:
- None: Will not log the events specified in the policy rule
- Metadata: Logs the time, user, object name and type, and action performed
- Request: Logs all metadata plus the body of the performed API request.
- RequestResponse: Logs everything including the response of Kubernetes APIs and the performed action
Your Ops team will know that all systems are active and healthy if the applications write log messages. Sometimes you have existing tools that should be integrated with the new Kubernetes cluster, other times you will start from scratch and create your own logging infrastructure. In any case, your approach should be consistent across all teams in the organization and provide easy to use solution for everyone.
Even more, the set of tools you choose should have proper role-based access control configured, usually per team, per project or environment. The logging infrastructure should not be open for anyone, like in a small startup where usually a single ELK cluster in the company is open for anyone, and everyone is welcome to create new dashboards and visualizations in Kibana, for any fields and indexes, as long as it serves their team needs and goals. In the large enterprise, the logging policy will differ, there will be a need to strictly separate access to Elasticsearch or Splunk indexes, Kibana dashboards, or the equivalent systems. All Kubernetes components will need to be assigned a specific match rule in fluentd, to send logs into different logging cluster endpoints, based on the pods, containers, and services involved. These are usually complex and fine-grained requirements that are not easy to configure and automate.
Monitoring allows at-a-glance understanding of the overall status and health of production systems. When issues are detected using predefined thresholds of metrics, responsible teams should be alerted. It may be email for preventive alerts (for example, in 5 days you may be out of space), slack channel (there is increased error rate for some non-critical system in the product), or a Pager Duty/OpsGenie/phone call/SMS for critical issues.
An additional challenge is how to distinguish an infrastructure problem from an application problem. If an application responds slowly, is it because the cluster is running out of capacity, or does the latest application update have a bug? This information helps you determine who to call in case of emergency. Another requirement is long-term retention for the collected metrics, to produce detailed reports later, which will show the total SLAs met in terms of latency or availability of your services.
Isolated Environment Requirements
Installing Kubernetes in isolation from the Internet and all open-source package repositories, requires additional components to be available in your environment, including Docker registry, Linux packages mirror, Helm charts repository for the Kubernetes manifests, and a binary repository. After the cluster is created, you may need to install more software, for example, service mesh like Istio, Grafana dashboards, or CI\CD tools. It’s an additional challenge to get all required Docker images available for the Kubernetes cluster to pull from the internal registry.
Requirement to Support Existing Tooling
How new tools will play with existing tools is another factor to consider. If you use configuration management tools, and the CI/CD pipelines already exist and are well tested, the integration with existing pipelines could be simple and straightforward – or they may require a lot of effort to migrate to a new way of doing things.
For example, an existing Jenkins, Spinnaker or Concourse CI pipeline, both build the artifact of an application, scans all source code for vulnerabilities, dockerizes it, and deploys to the virtual machines in your local OpenStack or VMware cloud. Then it tests live application responses to testing requests, and verifies the interaction between components, watches the live metrics in Prometheus to make sure the new deployment works as expected, and in case of detected thresholds violation (let’s say the new release works twice as slow in production and it was not expected, it did not replicate itself on Stage and QA clusters because there was no live traffic) the pipeline will perform automatic rollback to a previous release and notify the relevant teams about the issues. When introducing new Kubernetes clusters to existing infrastructure (especially of that complexity) we have to make sure that our cluster solution can be easily integrated with all other systems using webhooks, APIs, SDKs, or agents.
What Are the Options?
Having discussed the functional requirements, you can now consider which solution will satisfy these needs. Each comes with its own benefits and features, but limitations as well. Some of the limitations are not visible during the implementation, like the hidden costs of cloud-based Kubernetes services, so informed consideration and independent investigation prior to implementation are essential.
Managed Solutions and Their Limitations
Managed SaaS Kubernetes offerings don’t always meet your security requirements and/or regulations. You usually won’t have access to master nodes and their configuration or logs, so you will have limited capability to customize the configuration of the various Kubernetes components for your needs. Managed public cloud solutions usually don’t provide the ability to install their equivalent distribution in a local network and on bare metal servers and are thus not feasible for on-premise deployments. In short, if your goal is to deploy in isolated or highly regulated environments, managed solutions are generally not the right choice. After all, you’ll want to be in full control of the data and all system components.
Home Grown Solutions
While this option will certainly cover all your needs, it requires specific expertise that you might not have readily available in-house. In addition, there is a need to evaluate all the pros and cons of investing efforts into your own custom Kubernetes solution, when the same effort could be spent on innovation for your core business. With four major releases per year, it’s hard to keep up with upstream Kubernetes keeping clusters up to date. Another obstacle is that Kubernetes developers still occasionally introduce breaking changes, requiring careful migration either to a newly created cluster or upgrading all Kubernetes components in place on a live cluster, risking data corruption in the data store, and possible downtime.
Homegrown custom implementations also involve integrations of single sign-on, backup and recovery procedures, auditing, custom logging daemons configuration and optimization, Prometheus federation setup, etc. (to collect all cluster specific metrics within the cluster itself and send those for long term retention into a central Prometheus metrics cluster). All of that requires significant time and effort from the team, both for best practices research (to avoid known issues), and implementation with testing.
Third Party Vendors
Having a contract with a Kubernetes partner can simplify things. It covers most of your needs with experience and expertise, guiding your infrastructure efforts in the right direction, and ensuring your solutions are optimal and stable. But note that custom development may still be required to meet your specific requirements, so choose wisely. You may require additional professional services by your vendor during the integration with existing systems and processes as well as custom development. Do your homework and choose the right provider for your scale and niche.
In larger organizations, there is a separation of responsibilities between teams. One team may be responsible for compute instances, another for network and traffic ingestion, and a third for storage. Once you install your Kubernetes cluster, these teams will have to communicate and coordinate more, potentially requiring organizational process updates, maybe even a complete reorg.
For example, the deployed service manifests of all dependent applications will need to be updated if a networking and routing change occurs, like subnets change, or additional security requirements introduced. In case of cluster autoscaling, the compute and storage teams need to be aware of the maximum capacity a cluster might require during scale-out periods, and they should monitor the process closely to ensure all resources are utilized in a most effective way.
When your Kubernetes cluster is deployed on-premise, it is much more difficult to install and maintain all its components and dependencies. Consider these obstacles:
- If you have pure bare metal with no automation, you will not be able to auto scale properly.
- If you use vSphere, how will you handle self-service? Can users provision new virtual hosts themselves? Will they select the correct networking and storage? If a separate team is responsible, it may increase the amount of time needed to modify the cluster topology. Jira tickets will be covered in dust as other teams are busy handling requests, slowing iteration.
- How will you implement High Availability for the Kubernetes clusters and all their workloads? On top of all application and configuration considerations (like rack awareness for virtual machines and then pods themselves) there is a need for redundant power supply, effective cooling and so on.
- What is your strategy for disaster recovery? How will you meet high SLA set by management?
- What is your strategy for storage management? Stateful cloud-native applications can benefit significantly from a mature cloud-native storage management platform for application data.
- From an operations perspective, you must patch your OS with latest security fixes become available in the upstream Linux distribution repositories, and make sure that a newly patched or upgraded Docker and Linux kernel versions will be compatible with your Kubernetes version. Also, consider the overhead of manual work on Kubernetes version upgrades. In the cloud, it can be done easier if you treat your nodes in the cluster like cattle, not pets, and be ready to replace them at a moment’s notice. The cloud provider takes care of all datacenter related work, and the Kubernetes maintenance, but it comes with a higher cost for compute capacity and traffic.
- Finally, as an extension of the public cloud environment, you may have a completely offline data center which may not be connected to the Internet. In order to cover this scenario, your tool of choice should provide the feature to be installed completely offline with no external dependencies.
Best Practices for Security
Use role-based access control, pod security policies and network policies in Kubernetes, SElinux, ipsets and iptables for the hosts, admission controllers and admission webhooks. There are just too many things to secure, so SecOps will define the requirements and suggest solutions, but know that this is a huge effort for teams to perform manually, using the configuration of all Linux and Kubernetes security features.
You may want to explore how to improve tooling to enable faster deployments, disaster recovery, and failover capabilities. The recommended approach is to declare infrastructure as code and build “immutable infrastructure” that has its history of modification stored in Git. Avoid doing in-place updates. A better alternative is to prepare a new image of the OS with all necessary upgrades and hotfixes and replace or reinstall a virtual machine with an updated image, minimizing the risk of discrepancies between different hosts in the cluster. This is not an easy approach, but it contributes to overall stability by making sure the hosts are unified. Tracking all infrastructure changes in Git, helps to pinpoint issues immediately, as soon as a breaking change was introduced. You should have a CI/CD process, not only for application components but also for infrastructure changes, and the new virtual machines and their configurations should be automatically triggered by any changes that are submitted to Git. All of this requires additional support by your platform and enables automation and repeatability of the process.
While challenging to implement in highly restrictive and isolated environments, it isn’t impossible. Considering the requirements and solution options carefully can help you start on the right path from the beginning and avoid common pitfalls. For additional information, our team of SME is available to consult and answer questions. Reach out to us at firstname.lastname@example.org.