Kubernetes Logging and Monitoring Explained

Slava Koltovich | Apr 14, 2020

Most enterprises already have a reliable logging and monitoring system in place, so why should you worry about it in the context of Kubernetes? Well, traditional logging and monitoring tools are designed for stable infrastructure and application deployments. Cloud native environments, on the other hand, are highly dynamic. The IT world has changed and so must your toolkit.

A key challenge is that traditional systems rely on anchors like IP or machine addresses, but in a cloud native setting, these factors are continuously changing. Containers are spun up and down, and redeployed on different VMs. The same applies to VMs which are redeployed on different node pools and even in different segments. To keep track of all this, you need a new breed of logging and monitoring system, one that is cloud native.

You do have a few options which we’ll categorize into four groups:

Managed log collection and monitoring services: There are a number of available tools on the market. Some focus purely on log collection, others on metrics collection, while others do both. Datadog is a good example of the latter and probably the most popular managed logging and monitoring tool for Kubernetes clusters.
Custom built or pre-existing log collection and monitoring framework. Some companies may already have a cloud native compatible system in place. If that’s the case, you’ve got a winner.
Cloud-hosted logging and monitoring: All major clouds offer their own solution. Google has Stackdriver, Azure has Azure Monitor, and AWS has CloudWatch.
Self-managed logging and monitoring: Tools that are either built into, compatible with, or integrated with the tools you use to manage your containerized apps. Ideally based on open-source projects such as Prometheus and Elasticsearch.

While managed solutions are easier and faster to set up, reducing time to market, you are also giving a third-party access to your logs. If you’re dealing with sensitive data, this is not an option. If data sensitivity isn’t an issue and you select a managed solution, you’ll need to integrate it with your app and infrastructure lifecycle practices. Meaning, as soon as a cluster is created, logging and monitoring must be automatically triggered. Otherwise, it’ll be a manual process and, as we all know, manual processes are error-prone.

In case you have a custom-built or pre-existing log collection and monitoring framework and are introducing a container orchestration platform into your technology stack, you need to consider and plan for the integration of these technologies. Make sure they really are compatible.

Cloud-hosted logging and monitoring are really convenient and cost-effective. If all your apps are running in a single cloud and you have no intention to change that in the near future, this is likely your best option. But beware, if you do decide to move to a different cloud, you won’t be able to migrate your system. A lot of work and effort will be involved.

We generally recommend the fourth category as it provides the flexibility to switch vendors or environments without having to reinvest into a new logging and monitoring solution. Some vendors, Kublr included, leverage popular open source projects such as Prometheus, Grafana, and Elasticsearch, which you can continue using even if you switch platforms. If provided by a vendor, it will already be pre-configured (e.g. the ability to scrape metrics, query, summarize, define custom dashboards and reports, etc.), speeding up your adoption of the cloud native stack.

The real value of open source tools such as Prometheus, Grafana, and Elasticsearch comes at a higher level, though. When building your own application alerts, dashboard, and charts, you can easily migrate them from one environment to another without losing all the work you’ve already done. There is a real opportunity here not to lock you in, and we believe you should take it.

You’ll Need Monitoring on Two Levels

Whether cloud native or not, you still need to monitor all the layers of your technology stack, starting from the infrastructure (hardware, virtual machines, network, disks) to the OS to the application level.

Cloud native technologies add a few extra layers, including the container, container orchestration, and frequently a container network overlay. It also introduces additional meta-information that helps smoothly navigate cloud native environments. To identify your application, you can’t rely on a server IP anymore. However, you can tag and label your processes and thus identify elements of your applications in that manner.

Infrastructure layer and Kubernetes components: Kubernetes components, as well as infrastructure and OS, produce numerous logs, events, and metrics that provide a solid understanding of the overall cluster health — information you must leverage. To guarantee production-grade clusters, your Kubernetes vendor must provide this by default. It is their responsibility to ensure Kubernetes and the underlying infrastructure are always up and running and should thus include a set of meaningful dashboards and alerts providing a clear picture of infrastructure health and notifications to IT in case of a deviation or looming disaster. If they don’t, it simply isn’t a production-ready solution.

Application layer: Application logs and metrics, as well as how they are collected differ from system components, but one difference is particularly notable. While infrastructure and Kubernetes components are more or less pre-defined, enabling log and metrics collection systems to start collecting data from the get-go, applications have a lot more variables.

Your log and metrics collection system can, and should, collect all published logs and metrics including those that are application-specific if they are declared in the application metadata or can be auto-discovered. However, only a standard subset of these metrics (e.g. process CPU, RAM usage, etc.) can usually be visualized by default. Application-specific visualization – a.k.a. custom dashboards – is generally done by system users and operators.

Assuming you’ll run several clusters with multiple apps in each cluster, there are two ways you can set up logging and monitoring:

Run an independent stack in each cluster
Have one central aggregator for all logs and metrics

We recommend the second option, let’s explore why.

Why You Should Centralize Monitoring and Logging

Using one uniform logging and monitoring tool across different groups, apps, and clusters is a best practice. It’s also a best practice to centralize the management and governance of log collection and monitoring. It’s just more scalable from an organizational standpoint.

Unless there are strict security requirements to keep the data inside the cluster, we recommend a centralized approach. It’s easier for Ops to grasp the health of the entire infrastructure through one central location versus jumping between Grafana instances for each cluster. Additionally, trends may only be uncovered when aggregated. Without a single big picture, problems may go undetected until it’s too late.

Centralizing it also brings some challenges. For instance, how do you restrict access to the data? Ideally, you should use the same RBAC tool that you use for the clusters. After all, if someone shouldn’t have access to a particular cluster, they most likely shouldn’t have access to its logs either.

Properly sizing, scaling, and operating a centralized monitoring and log collection system is another challenge that may be addressed by relying on an open-source-based integrated solution supported by your Kubernetes platform provider.

Implementing this with the Elastic Stack is relatively easy because Elastic implements RBAC based on indexes and supports various scaling scenarios. That’s key, since every cluster or app sending information can send it to a different index. The Elastic Stack allows you to set up access control on top of these indexes.

A Kubernetes platform that integrates with an identity management system or single sign-on, should also set RBAC up for logging and monitoring. Generally, for monitoring, data isn’t that sensitive, especially not on an aggregated level. But in some cases, you may want to filter out all accessible info from query results or your Prometheus stack. There are tools on the market with that ability, including Datadog in the managed service category.

Then There is Scalability

Setting your logging and monitoring stack up in a dev and QA cluster is one thing, going into production with multiple clusters each with dozens of nodes is another. First of all, your logging and monitoring system must be able to handle a large amount of incoming and outgoing data from all pods — if your apps are broken down into small components, maybe even microservices, we’re talking about a lot of data.

Then there is the connectivity issue. What if your logging and monitoring tool disconnects from the environments your apps are running in? Whichever tool you use, it must be able to handle these types of complications. It must be able to work in unreliable environments, but it must also be able to scale automatically when your workloads and clusters do.

In cloud environments or with managed tools, you don’t need to worry about it, your provider will handle all this by default. However, in semi-disconnected environments, it’s up to you. So make sure your Kubernetes platform’s logging and monitoring component can handle this — without it, you aren’t ready for production.

Smart Alerts for Ops

Once you’re managing multiple clusters, you’ll need a smart alerting system. With smart, we mean that it won’t flag unimportant events but rather escalate and aggregate failure messages to single out the root cause. Also, you’ll want to avoid setting up separate alerts for different environments as your system must understand the context. Centralization is important for alerts on system functionality and your ability to enforce governance rules across your environment. For instance, production alerts should go into more critical channels, while dev alerts will go to specific groups.

Some frameworks may have open connectivity requirements, such as requiring every instance, worker node or app to send metrics and alerts to the outside world. Other solutions may be more conservative, collecting metrics and logs just through a few endpoints and open portholes. The latter is preferable in an enterprise environment.

Conclusion

A cloud native software and system architecture is a lot more dynamic than traditional systems, possibly making your current logging and monitoring toolkit obsolete. Numerous tools are hitting the market to fill that gap. Whether managed services, cloud-hosted or open-source tools. Our recommendation is going with open-source options such as Prometheus, Grafana, Elasticsearch. They are incredibly flexible and can easily migrate between environments if you need to switch vendors or environments.

As you consider your options, keep in mind that there are two layers that create valuable metrics which you’ll need to capture: the app layer and the infrastructure layer. Beware, some monitoring solutions may be geared more to one or the other layer. While most can monitor both, it’s better to rely on a universal and open tool from the beginning.

While it’s possible to monitor clusters individually, we don’t recommend it, particularly as you scale. You’ll need a single integrated view of what’s going on in your infrastructure. That’s only possible if you centralize monitoring. For many organizations, that may even be an InfoSec requirement. If that’s the route you go, you’ll need to integrate logging and monitoring with your RBAC system ensuring only people who should have access to the metrics really are.

The cloud, containers, and Kubernetes have brought scalability to a whole new level. If elasticity is important to your apps, your logging and monitoring system must be able to scale right with it.

Schedule a Kublr demo to learn how our team handles logging and monitoring.