X

Kubernetes Maturity Survey

Participate and receive the report.
Estimated time: 5 min.

[hexagon_mask.png]

Replacing a Data Brick for a GlusterFS Instance Running on a Kubernetes Cluster

While working on a project build environment in AWS using Kubernetes as a container orchestration engine, we recently hit a brick wall which threatened to disrupt our continuous delivery experience.

In addition to these technologies, we also used Jenkins and Nexus as Kubernetes deployments, both of which require file storage. To meet this need, we setup a GlusterFS cluster, also running under Kubernetes control.

Then, out of the blue, Jenkins started responding slowly. Very soon it became completely unresponsive. What happened and how did we find a fix? There were many factors at play, ranging from the GlusterFS setup and particularities to Kubernetes volume management implementation details. But getting these answers took some detective work.

Here’s a summary of our investigation, findings, and lessons learned.

The Setup

To understand more about the origins of the predicament, let’s take a more detailed look at our setup.

The environment consists of a one-master/three-node Kubernetes (K8S) in AWS and a three-node GlusterFS cluster, based on StatefulSet, running in K8S.

Each GlusterFS node is backed by an Amazon Elastic Block Store (EBS) volume. Both the GlusterFS instance configuration and data of bricks, managed by the corresponding instance, are stored on a corresponding EBS volume.

Single-instance Jenkins and Nexus servers are also run in K8S and use GlusterFS as file storage. Two volumes are setup on the GlusterFS cluster — one for Jenkins, and one for Nexus. Each volume is configured as a replicated volume with two replicas and one arbiter distributed over three GlusterFS instances.

This setup ensures that cluster will stay functional in the event of the loss of a node and ensures that split-brain issues do not occur.

What Happened?

Why did Jenkins stall? Here’s a breakdown of what happened.

Two instances of the GlusterFS cluster are used for the volume’s data replicas while the third instance hosts the volume’s arbiter bricks. As mentioned above, this means that the third instance isn’t intended to store data by itself, rather its purpose is to ensure quorum in cases when one of the three nodes goes down and resolve conflicts that otherwise would lead to split-brain issues.

As an arbiter, the third node was configured with a smaller drive. Whereas data nodes had 100GB drives, the arbiter node was setup with a 1GB drive. When a new drive is attached to a node for the first time, it is initialized (formatted) by K8S. When formatting a drive for ext4, Kubernetes uses default settings of mke2fs utility. By default, mke2fs calculates the size of the inode table based on the size of the drive.

As a result, the arbiter’s drive could only store up to 64K files and directories (the inode table size), whereas data nodes each had much more generous 6.4MB inodes tables.

However, even though a GlusterFS arbiter node doesn’t store actual data, it still creates one file or directory for each file or directory stored in a replicated volume. Add to this the fact that Jenkins creates many files in its data directory and we were heading for trouble.

After just a few weeks and several hundred builds, the inode table on the arbiter instance was exhausted (even though the volume still had plenty of space available) and GlusterFS stopped responding to file writes.

Resolution

In order to resolve the issue, we took the following steps.

A new, larger EBS was created and registered as a Kubernetes persistent volume (a corresponding persistent volume claim was also created to enable mounting this volume to pods).

A new pod was started with an Ubuntu Docker image, to which both the old and new EBS volumes were mounted. We used ‘kubectl exec -it pod — /bin/bash’ command to run an interactive shell environment in the pod. All files were then copied from the old volume to the new one.

Extended attributes needed to be set for brick directories to ensure that GlusterFS accepted them as part of the existing volumes, as described in Red Hat’s documentation.

Next, we needed to substitute the new EBS to the GlusterFS cluster, as follows:

  • First, delete the interactive shell pod.
  • Then, delete both the old and new arbiter Persistent Volume (PV) and Persistent Volume Claim (PVC) Kubernetes objects. This doesn’t remove the EBS volumes, it simply unregisters them from the Kubernetes cluster. Similarly it doesn’t lead to a disconnection of the old EBS volume from the GlusterFS arbiter pod (yet), as the pod is still running.
  • Next, we deleted the GlusterFS arbiter pod. At this point, Kubernetes un-mounts the old EBS volume and disconnects it from the EC2 instance where the pod was running. GlusterFS StatefulSet attempts to restart the pod of course, but constantly fails since there is no PVC to connect this pod to.
  • Then, we recreated PV and PVC Kubernetes objects using the same names as the old arbiter’s PV and PVC, but pointing at the new EBS.

After this last step GlusterFS StatefulSet at last succeeded in recreating the GlusterFS pod with the replacement EBS connected, and the GlusterFS cluster was once more fully functional.

Lessons Learned

After finding ourselves in this predicament, (and a way out of it), here are five lessons learned:

  1. Always monitor your inodes usage.
  2. Symmetrical clusters are preferable to asymmetrical ones. If we had used EBS of the same size on all GlusterFS nodes, we would have avoided this issue.
  3. Kubernetes is great. Particularly when managing and operating applications that care about persistent data — this process is a breeze.
  4. Despite this, don’t forget that Kubernetes provides an abstraction layer. As such, it is prone to abstraction leaks and must utilize underlying abstraction defaults. Even though those defaults are often reasonable, they may not be ideal for specific use-cases.
  5. GlusterFS is a useful tool, but as any great tool it has its own set of implementation requirements that may affect its use in production — such as disk requirements for arbiter bricks in replicated volumes.

You May Also Like