Want a lab that’s fun to run and beneficial for learning about troubleshooting using logging and monitoring? We’ve got your tutorial!
One of the advantages of centralized logging and monitoring, which is available with Kublr’s Kubernetes distribution out-of-the-box, is the ability to easily correlate events that occurred in the cluster and lead to the issue that need troubleshooting. We’ll explore the features of Prometheus and Grafana monitoring and Kibana logging visualization to easily identify cluster and application problems, and the relationship of these events to each other.
It is important to know in advance which types of events we are interested in visualizing in Kibana in order to create useful dashboards and to avoid throwing everything together into an “all-inclusive” messy dashboard – which is difficult to read and understand. The author of the visualizations in Kibana and Grafana dashboards has to be familiar with all components of the cluster and the applications running in order to plan in advance what metrics and fields should be displayed on a Dashboard, and to enable users to quickly identify the information they are looking for, instead of getting lost in piles of unnecessary data.
For our tutorial, we are going to use an existing Kublr cluster which provides centralized Prometheus and Grafana for metrics aggregation and visualization, and also provides centralized Elasticsearch and Kibana application stacks. We recommend you to try our demo for this tutorial if you do not have any existing development or experimental clusters where you can create/break things.
With Kublr you can easily automate the deployment of a fully configured Kubernetes cluster with the necessary self-hosted logging and monitoring components, pre-configured, so you can skip the configuration burden necessary to obtain a fully functional Kubernetes cluster with advanced capabilities of per-pod monitoring and centralized logging.
Feel free to use any existing cluster with similar configuration if you have one, in order to follow the tutorial. First, we will explore the environment and make sure that everything is configured properly and we can both access all metrics exposed to Prometheus, by pods in Kubernetes cluster. Also, verify that Kibana can display all logs from all containers in the cluster. Navigate to the Grafana dashboard, in our case by clicking this link in Kublr dashboard:
Log in and open the pods dashboard, or any standard system dashboard that you have in Grafana which displays metrics like CPU, Memory, Disk I/O, or Network information from your pods (to confirm that the cluster is suitable for the tutorial and meets the prerequisites).
In our case it will be the Memory dashboard (at a later stage, we will demonstrate an imaginary “disaster” which happened in our cluster, and how we find the reason using our new metrics and logs correlation dashboards):
You can see that we have Memory information collected per pod. So this prerequisite is met in terms of metric collection from ALL pods in the cluster.
Exploring Kibana for Kubernetes pods log collection
Login to your Kibana. Follow these instructions if you’re on a fresh “Kublr-in-a-Box” setup. Create an index pattern from “Management” – “Index patterns” if this is your first time accessing the Kibana in the cluster:
If you’re using an empty Kublr cluster just for this demo, create the “logstash-*” pattern to search all similarly named indexes and logs, and select the time filter field name @timestamp, shown here in these screenshots:
Please note that this is a demo example. You will not use such a pattern for your production cluster if you are about to consume tens of thousands of log events per second. The users of your Kibana service should search indexes separately, divided either by department name for streaming the logs into a particular index or by cluster type\project name, with any kind of logical separation that will allow you to avoid overloading the cluster with queries that are “too wide” and not specific. If not tuned and segmented properly, vague queries will cause ElasticSearch to work unnecessarily hard each time (in terms of CPU and Disk IO utilization during each query), so for the queries to be more specific, your job as the administrator of logging system is to divide them at least on an index level.
During this demo, we are going to search “all logstash-*” indexes, because we do not have yet any logical separation, and no large scale, while we work with a single default Kubernetes cluster that is the only one streaming its logs to this ElasticSearch. If you use an existing cluster of your own setup, proceed to verify if the logs are collected from all types of pods, both in “kube-system” namespace and in the “default” application namespace.
Navigate to “Discover” in the dashboard, and select “kubernetes.labels.app” from the fields selector to see some of the pods where we have collected the logs:
Building custom logging dashboards for applications running on Kubernetes
At this stage, we are ready to set up the scenario for this tutorial. Imagine your cluster has the following components:
- Standalone MySQL database
- A “db-writer” HTTP service which exposes API to receive data and writes it to MySQL (think of any service that needs to save information in the database)
- Several “applications” that interact with the “db-writer” using it to save their information into MySQL tables too. They both send data to “db-writer” and sometimes read from it
This setup is not as complex as it seems, and we have both custom “db-writer” and the demo “applications” ready for you to use in this tutorial directly from Dockerhub.
To proceed, deploy the basic MySQL “deployment + service” combo using this manifest:
- name: mysql
- name: MYSQL_ROOT_PASSWORD
- name: MYSQL_DATABASE
- port: 3306
This allows us to use the tiny database for demo purposes, but is, of course, light years away from how real production setup of MySQL would look like. It’s good enough for this exercise – after all, we’re about to break it and use the logs and metrics to quickly troubleshoot the reason.
Do not change the “120m” CPU limit. It is set for this exercise. Our imaginary MySQL must be “slow” and we make it slow for the demo by heavily limiting the allowed CPU resource. It will cause an interesting effect in the demo Node.js application that we’re about to use, and we’re going to investigate the issue using the available logs and metrics.
You can have a quick look in Kibana to see if the MySQL pod loaded properly with no errors. Navigate to “Discover” and type “kubernetes.container_name: mysql” into the search panel. You can also select to show only the “log” field from each message (by clicking “add” near the field name in the sidebar), this way having clear and readable logs on the right side, it’s useful when we are watching logs of only one service and are not interested to see all other fields like “namespace” name, “container ID”, “cluster name” and so on. You should see something like this:
If you see logs from the MySQL pod, then everything is ok and we can deploy the “db-writer” service using this file:
- name: db-writer
- port: 80
When this service is started, it only notifies us with three short messages about the port number and creation of the MySQL table. This is an indication that it was able to connect to MySQL service using the built-in “MYSQL_SERVICE_HOST” environment variable (supplied by Kubernetes). If you change the query in Kibana to “kubernetes.container_name: db-writer” you can see its logs:
Please note: if you change the MySQL deployment definition for a different “Service name” (like “test-mysql-for-tutorial”) our “db-writer” will not be able to find the MySQL service and will fail, in this case, manually set its “MYSQL_SERVICE_HOST” variable in your template to the “cluster.local” internal name of your own MySQL service, or even an external one if you decide to use external MySQL for the tutorial.
Run the “execute” command with just “env” to see all environment variables in the “db-writer” container:
kubectl exec <your pod name> -- env
As you can see there is a lot of information about all other “neighbors” in the cluster, like other pods and services, with the ports and protocols they listen to. It is useful for the development of independent native Kubernetes services that utilize this feature and read all necessary information from the environment instead of a config file or command line flags.
Deploy applications to Kubernetes and analyze the logs
There is one more service to deploy, which will simulate the “readers and writers.” Those are our custom services (representing imaginary website backend components) that use the “db-writer” service to read/write to the MySQL database for them. Each “website-component” will send a “service_name” and “data” JSON fields as HTTP POST requests, that will be written into MySQL by “db-writer” when received. We can control the rate of messages they send to “db-writer”, and the length of data they send. All of these parameters are pre-defined during deployment of the “website-component” pod, so we will deploy several clones of the same pod but specify different rate for their operation requests.
Then we will use Jmeter or Apache Benchmark to stage a DDoS attack on our website, which simulates an excessive load on both our services and the database by submitting too many “write data” requests through one of our exposed “website components” (we can pick any of the application pods, and hit it with a large amount of HTTP POST requests simulating abuse of any imaginary “article submission” / “post a comment” function of the website, that was not protected properly by the developers). This will cause trouble for all of our pods and the database, and we will see how easily we would identify such a real-world scenario and anomaly, using our dashboards in Grafana and Kibana.
There is also chance that our website components are not perfect, and may have a memory leak (or a CPU hog loop) hiding in the business logic of the application, or in a recent feature that the developers released on a Friday night in a hurry to meet the deadlines. This is a “real-world” scenario simulation that includes two small Node.js services that were written for this tutorial.
Here is a simple schema of our exercise setup, all components, and their interactions:
The “website component” includes a random “activity simulation” generator that will send data from time to time on a random interval, to “db-writer”, to simulate user activity on our imaginary website. So we’re about to see activity in Kibana logs, and in Grafana metrics of the pods.
Here is the deployment of website-component pod, feel free to create as many replicas as you think your cluster will handle, but be aware that each of them generates up to twenty random “read requests” per second by default, and several random “write” requests per second (simulating comments posting, or articles submission activities, on the imaginary website).
Save the following into a “website-component-deployment.yml” file, modify it if necessary for your cluster (if you are not using Kublr cluster for the tutorial) and deploy.
- name: website-component
- name: POST_REQUEST_FREQUENCY
- name: GET_REQUEST_FREQUENCY
- port: 80
- port: 80
After deploying it with “kubectl create -f website-component-deployment.yml”, and waiting a few minutes to allow it to finish creating the load balancer, have a look at the service definition to get the endpoint for access:
kubectl get service website-component -o yaml
Use the found endpoint to open the API default welcome message that is saying:
Hello! I am Kublr tutorial Website-Component API! Please use POST requests to “/submit-comment” to make me POST your data into db-writer. Or send a GET “/load-news” request to read some data.
If you see this message that means our “website-component” pods are up and running, and we are ready to inspect the activity they are generating, in Kibana logs. Navigate to Kibana and try this search query to show all recent logs from any of the “website-component” pods: “kubernetes.container_name: website-component”
Then check out the activity these pods are generating on the “db-writer” and MySQL. Notice the “db-writer” logs and the high rate per second of these log messages (starting from the moment when we deployed “website-components”):
The three replicas of “website-component” are generating approximately 2500 log messages per second (most of these operations are reads, because we set the frequency of “dummy” actions to 20 reads per second, and 5 writes per second, using “GET_REQUEST_FREQUENCY” and “POST_REQUEST_FREQUENCY” variables in its deployment manifest).
Now have a quick look at the default Grafana dashboard of pod Memory metrics, navigate to “Kubernetes Memory usage” dashboard, and select the needed pods by starting to type “website…” into the “pod” names box like shown below:
When you select them all, you can also start typing “db-writer” and select the writer pod too, so we can see all our important pod Memory consumption. Scroll down to the bottom panel of pod memory graphs, and unfortunately, we see that “db-writer” memory consumption climbs for no visible reason, and does not seem to stop at any level. It’s also not a pre-warm of the service, because it has no caching features in it, so there’s something wrong with this version of the “db-writer”, but what exactly, how do we find out the reason and notify the developers about this bug? Moreover, how do we prove to the developers that the issue is not on the DevOps\Infrastructure side, but in the application itself?
Note: When reproducing the tutorial on your own cluster, the numbers may vary, but the overall situation will be the same. You might see a faster or slower leak graph, and the number of log messages might vary as well, but the ratios will be similar and the solution will also be visible very well on the graph, and in the logs. You can also experiment further with the given pods and their settings, by adding or lowering the CPU resource limits, changing Get/Post frequency numbers in the “website-component” pods, scaling the pods to tens or hundreds of replicas, or using your own unlimited external MySQL to see “up to how many” write messages from these pods it can handle, etc’. This is a great playground for learning visualization in Kibana or Grafana, and we will create an interesting dashboard soon.
Troubleshooting a memory leak in an application by correlating log messages from all related services
A quick glance on the type of messages we have from “db-writer” suggests that the memory leak could be caused by some data accumulating inside this writer service process, which is not being released or written “somewhere” (to MySQL?) fast enough. Let’s check this theory by comparing the amount of messages which include “Received data from website-component-xx to write into MySQL” to the number of received messages with the text “Information from website-component-xx was written to MySQL” text.
What if we have more data pouring into the writer service, than what MySQL pod is able to write successfully on time (and this causes a queue of messages to build up inside our db-writer service, all waiting to be written to MySQL and accumulating in the memory of db-writer container, while the MySQL is slow to respond, and does not process message writing quickly enough)?
kubernetes.container_name: db-writer AND “was written to MySQL”
returns 6620 hits (appearances of the message in the last 15 minutes):
But “Received data from” appeared 13411 times! Search for this:
kubernetes.container_name: db-writer AND “Received data from”
The important keyword here is “Received data from”, indicating all messages that notify us about data received from any of the “website-component” pods.
Now we know that a possible reason for a memory leak is too many write data events sent to “db-writer.” We can lower the replica count of “website-components” from 3 to 1 to see if it helps, just to verify that our theory is correct. Go to the Kubernetes dashboard, and navigate to “Deployments”, then open the menu button on the right of the “website-component” deployment and select “Scale.”
A window will open asking for replica count number, set it to “1” to shrink our deployment size:
After the change, we need to wait some time to see the effect. You should see something like this after 5 minutes, for the same query:
Now that the message rate actually changed, we know that just one pod exists, and the load on “db-writer” is probably lower than it was, so lets calculate the ratio between “received from…” and “was written to…” messages. In order for the calculation to be meaningful, you will need to select the time frame starting from “new” message rate, which is clicking and dragging on the time series panel to select. In the following screenshots, the selection is for 10 minutes of time, starting from when “single replica” deployment became active:
We have roughly 3000 hits of “Received data from…”
And look at that! We have 3712 hits of “was written to…” messages for the same time period! (More than what was received during the same time).
That means that the “db-writer” service was working hard to “flush” the queue it had in memory, into MySQL, and did, even more, write operations to the database than the amount of new “write requests” it received during this time from “website-components.”
After some time it stabilizes and will be equal so that all requests are written to MySQL and nothing remains inside the “db-writer” memory. But that will happen only if your MySQL handles it well. If it doesn’t, you might see a similar but a little different effect, for example if after scaling down the number of “website-component” replicas to one, and waiting 15-20 minutes, you still see a discrepancy between the “Received messages” and “Written to MySQL”, try lowering the number of random “POST” requests your website-component sends to db-writer, by changing the “POST_REQUEST_FREQUENCY” variable and redeploying the “website-component” YAML manifest. Or change the CPU limit of mysql from 50m to 80m (or 100m), it will still keep it relatively “slow” for the rest of the tutorial (which is what we need for the next exercise), but will allow it to consume all current messages that the “db-writer” tries to write.
Let’s see the effect on actual memory utilization after we scaled down the number of replicas (we show “last hour” in Grafana to see the effect over time, both during the leak and after scale down of replicas, which lowers message rate to MySQL and allows db-writer to stop leaking memory by not having to create too many new threads that will be waiting in queue to write their message to MySQL and consume memory in db-writer pod):
Note that the timestamp on the screenshot is “1:40” which is approximately the same as we saw in Kibana when the deployment was reduced from 3 to 1 replica. So it was definitely the issue and we resolved it! But it was a very basic, “manual” investigation. It is possible only when you have a simple deployment of just a fewapplication components. When working at scale, with tens or hundreds of services, and other complex infrastructure services (like databases, caching clusters, messaging queues, load balancers, and so on), the only way to be able to handle the massive amount of data that is needed to detect failures and issues, is to visualize them on a dashboard. You can represent for example the number of “log level: Error” and “log level: Info” messages as a graph, with different color lines for each log level. So when the “Error” line spikes abnormally, it is visible on the graph, and the on-call engineer (SRE or DevOps team member) can zoom in and investigate, by reading the logs of that particular time frame where the spike of errors was detected.
Another useful use case we want to mention is the correlation between “response time” or “latency” metrics, to cluster resource utilization like CPU/Disk/Network. So a use case of visualizing “log level” messages is relevant to Kibana and ElasticSearch because it is based on log messages rate. A use case of “latency to cluster resources correlation” is based on metrics, which are collected to Prometheus in our case as a time series data points, and can be visualized in Grafana. Please refer to one of our previous articles if you are interested to learn more about advanced Grafana visualization techniques. In this tutorial we will continue to concentrate on Kibana logging and visualization while just briefly mentioning Grafana and using the default set of metrics and dashboards supplied by Kublr demo platform, to confirm our findings when necessary.
We now want to change our setup a little bit, to simulate a website with moderate activity. Change the following variables in “website-component” manifest and redeploy it:
- name: POST_REQUEST_FREQUENCY
- name: GET_REQUEST_FREQUENCY
This will make our “baseline” read/write activity low, so we can work on the Kibana dashboard without worries that our cluster will collapse for some reason. And when the dashboard is ready, we will try to imitate a DDoS attack on the endpoints of “website-component” service (those are “/submit-comment” and “/load-news”) using the “Apache Benchmark” tool and will examine how quickly the dashboard reflects the anomaly.
Verify that after some time the situation is stable and db-writer is able to write all messages and we have no issues. In our example, we saw 30 minutes. You’ll want to make sure that the last 5 -15 minutes show an equal number of “Received from” and “Written to” messages):
If we want to visualize the rate of messages, instead of switching back and forth between queries comparing their results (the number of hits), we can save the queries and put them on a graph. Let’s save the main query “kubernetes.container_name: db-writer”, by clicking the “Save” button and giving it a descriptive name:
Then switch to “Visualize” and add a new Line Graph:
In the next step, you can select either a saved query, as a base for the graph data points, or start with a new search. Select the query we just saved:
At first, you will see just 1 point, which is an aggregated “count” result based on the time selection which was “In the last 30 minutes” in our case:
We do not want to see the aggregated number and instead, we need to spread the results on the graph for each point in time. The Y-axis on the graph is the “hit” count, and X-axis is the time. Click the “X-Axis” and select “Aggregation: Date Histogram” then a field on which we want to break the results is “Field: @timestamp” Leave the interval at “Auto” if you want Kibana to decide how many data points to show based on the zoom level (for example if you select “Minute”, it will look good with the “last hour” and “last 30 minutes” searches, but will be too heavy if you select “last week” for example, because Kibana will try to display each and every minute in the last week, so the dashboard will be too heavy for the browser). Optionally name the “Custom Label: Timestamp” or anything you would like to, this is an override of the default label for “X-Axis”. Here are the steps to follow:
When you click the “Play” button (Number 4 on the screenshot), the graph is updated with whatever settings were changed in the left sidebar. In our case, we see the needed results from “db-writer-logs” query.
Now we would like to separate the results of “Received from” and “Written to” queries, so we can see them as two separate lines on the graph. To do so, we need to add sub aggregations, based on some filters, if we want to drill down the “base query” (which is the broad one, just “all logs from db-writer” with no “AND” filter) and display each sub aggregation as a separate line, click the “Split series” and select “Sub Aggregation: Filters” This will allow us to add a free text “Filters” which are essentially the “AND something” keywords we were adding in:
[kubernetes.container_name: db-writer AND “Received data from”] query.
To add more filters press on “Add Filter” button, create three of them to visualize the count of each by separate color line on the graph:
- “Received data from”
- “was written to MySQL”
- “Successfully read from MySQL”
The 3rd one is new, we did not use it yet, but it also appears a lot in the logs so we would like to know how many times db-writer actually reads from MySQL, because what if something goes wrong with the reads, and we were monitoring only the “write” interactions with MySQL?
Save this visualization using “Save” and name it “db-writer-messages”:
Navigate to “Dashboard” and click “Add new dashboard” You will see an invitation to add new visualizations to the dashboard, click “Add” and select our previously created Graph.
You can resize the graph by dragging its bottom right corner. Position and resize it as you need, and click “Save” to save the dashboard. Give it a name like “Database interactions”.
We can theoretically add more items to the dashboard later, based on some other queries and filters, so that we are not looking at a single graph, but a collection of visualizations. These can be Geo maps, heat maps, pie charts, any types of diagrams and graphs that are available in Kibana.
Let’s quickly add another panel to the dashboard which will show the operations performed by the “website-component” services (remember we can have several replicas of these). Create a new search and save it: “kubernetes.container_name: website-component”
Then visualize with a line chart, by adding these filters to the X-Axis “Split series” fields:
- “The request to load necessary news from MySQL succeeded”
- “Received a comment from user”
- “The request to post a comment to MySQL succeeded”
Add this panel to the dashboard (we used “Line Width=4” in the “Metrics & Axes” of visualization to make the line bold for the screenshots readability) so it will look like this:
Website stress test and events correlation in Kibana dashboard
Now we will activate Apache-Benchmark to simulate a DDoS attack. Let’s launch this first on the “load news” endpoint, to try and see if the application we deployed is protected from such abuse method as abnormally high requests rate from the same IP. First, find your external endpoint that can be used to access the service from the internet, run “kubectl get services” and copy the EXTERNAL-IP load balancer address from “website-component” service. We will call this address
You need to install the “ab” tool locally or use an official Docker image to launch the shown commands and generate a high load of requests on your cluster. The command itself just spawns a lot of requests of a particular type, so it does not consume too many resources on the local machine, feel free to run it on your laptop.
Run the following to cause 8000 read attempts from the Load Balancer endpoint of “website-component”:
ab -n 8000 -c 5 -s 2 <website-component-endpoint>/load-news
The command runs 8000 requests, with 5 concurrencies, and a 2-second response timeout.
Have a look at the detected anomaly on the graph below, we have found out that our application infrastructure is not protected from excessive request rate from same IP address!
Note the rate of “The requests to load…” (access of “/load-news”) blue line in the bottom graph, we can easily see that this also caused the db-writer service to send a lot of ‘reads’ to MySQL (“Successfully read from…” cyan line on the top graph) between 22:04 and 22:16.
To see what happened to MySQL during this short ‘DDoS’ simulation, open the “Kubernetes CPU usage” dashboard in Grafana and select the mysql pod on the right:
Scroll to the bottom panel to see the “per Pod” CPU usage graph:
Open two bugs in Jira, one for not protecting the MySQL service by circuit breaking between “website-components” and “db-writer” on excess load, and the second for not limiting requests rate per second from a single IP address, and let’s proceed to test the “write” endpoint named “/submit-comment”.
This endpoint represents an imaginary API that is used to post a comment from a user. Hackers like to abuse any endpoints that have their input texts not filtered properly before a submission to the backend API and the subsequent write to a database. We are about to test whether it’s limited to a particular size of the submitted message, or we can post all contents of the trilogy of “Lord of The Rings” as a single comment and break that website.
Create a small JSON file first, it should contain something like that:
"data":"trying to hack that website!"
Name the file “my-comment.json” then execute the “ab” command in the same folder where you have the file:
ab -p my-comment.json -T 'application/json' -n 1 <website-component-endpoint>/submit-comment
You will immediately see the output of “ab” that tells you a lot of details about the latency and such, but we’re not doing this to measure performance and latency but to abuse the security vulnerabilities of our imaginary website or application, and cause all sorts of trouble. This time we’re using the vulnerability of no character limits of the “/submit-comment” message field, and not rate limit on requests per second, let’s see what damage can we do to the cluster and how easily it is seen in Kibana and Grafana dashboards. Thanks to the visualizations, the DevOps/SRE teams can respond quickly and make sure the website will be patched, and the hotfix deployed in no time thanks to Kubernetes deployments agility.
Modify the “my-comment.json” file to have a massive text in the “data” field, so it will look like that or even worse:
And try to submit that comment once again. You will see the same “ab” output, and it tells us about successful submission to the API, the “Total body sent” is our JSON data weight:
Compare it to the previous “Total body sent: 269” to confirm that we actually sent much more data this time. This means that the developers did not limit the possible comment size, and we can attempt making them run out of disk space by sending plenty requests of such size or larger. Now try to run the submission over time, set to 3600 seconds (1 hour) and analyze the impact on the services:
ab -c 5 -t 3600 -p my-comment.json -T ‘application/json’ -n 10000 <website-component-endpoint>/submit-comment
We added concurrency 5 and time 3600. When the run is completed, examine the dashboards, including the “Kubernetes disk usage” in Grafana. You will see a high spike in I/O rate (the most bottom panel) and consumption of disk space starts to climb up (the top panel).
The impact also can be observed in Kibana, though we will only see the rate of submitted comments per second, but not their POST request body size (we can parse the body size actually if we want, and add that to the dashboard. We leave it up to you as an additional exercise, to visualize the total submitted bytes rate. Parse the messages which contain “Received a comment from user:
In this tutorial, we have seen the importance of a centralized logging and metrics system for any application that is deployed in containers. The amount of time troubleshooting more complex issues in an environment with a larger amount of services can be measured in tens of work hours, and many hours of downtime, loss of users and so on. Deploying the application in Kubernetes using an enterprise solution like Kublr helps you to avoid the hassle of maintenance of centralized ElasticSearch logging clusters and Prometheus nodes, by supplying all these necessary components built in from the start, fine-tuned by experts for Kubernetes use case, and are easily deployed with a click of a button.