How to Run a MongoDb Replica Set on Kubernetes PetSet or StatefulSet

Running and managing stateful applications or databases such as MongoDB, Redis, and MySql, with Docker containers, is no simple task. Stateful applications must retain their data after a container has been shut down or migrated to a new node (for example, if during a failover or scaling operation, the container was shut down and re-created on a new host).

By default, Docker containers use their root disk as ephemeral storage, a chunk of disk space from the host filesystem which runs the container. This disk space can’t be shared with other processes, nor easily migrated to a new host. While you can save the changes made within the container using the “Docker commit” command (which creates a new Docker image that will include your modified data), it can’t be used as de facto way to store content.

On the other hand, the “Docker volume” feature allows you to run a container with dedicated volume mounted. This volume comprises another chunk of space from a host machine (but this time persistent and independent from container lifecycle, it’s not being deleted after container removal), network storage, or a shared filesystem mount, depending on the storage plugin you are using.

For production grade containerized stateful application management, you can take advantage of tools such as “flocker” and “convoy.” To avoid manually configuring these for each Docker host in your cluster, you can use Kubernetes “Persistent Volumes” which abstracts the underlying storage layer, be it AWS EBS volumes, GCE persistent disk, Azure disk, Ceph, OpenStack Cinder, or other supported systems.

In this tutorial we will explain how you can run a containerized replica set of MongoDB 3.2 database, using Kubernetes 1.5, utilizing the StatefulSet feature (previously named PetSet). The StatefulSet feature assigns persistent DNS names to pods and allows us to re-attach the needed storage volume to another machine where the pod migrated to, at any time.

Note: To proceed with the tutorial, a competency with Kubernetes basics and terminology, like pods, config maps, and services, is required.

The StatefulSet feature is used with a dedicated “service” that points to each of its member pods. This service should be “headless,” meaning that it doesn’t create ClusterIP for load balancing, but is used for static DNS naming of pods that will be launched. This service name will be referenced in “spec: serviceName: ” section of the StatefulSet configuration file. It will cause the creation of enumerated DNS records in this format: “name-0,” “name-1,” “name-2” etc. Luckily, Kubernetes service discovery allows any pod to access services in the same namespace, by simply querying the service name. If a pod is launched and detects its own hostname as “mongodb-4”, it will know for sure where to look for the master, which is “mongodb-0”.

In StatefulSet pods are launched strictly one after another. Only when the previous pod is successfully initialized will the next one be started. This way you can confidently plan your deployment of pods with “name-0” being the first launched pod. “Name-0” will bootstrap the cluster, replica set, etc., depending on the application you run.

In MongoDB the master node will initialize a replica set. Then pods named “name-1”, “name-2”, etc. will recognize the fact that a “replica set” was already created and will connect to existing nodes. It’s worth noting that when deploying applications like Consul, MongoDB, Redis, and the like, it can be hard to know which is the current master node. This is because these apps periodically re-elect a master\primary node, not just during failover, but, for example, even after “rs.add(hostname)” (MongoDB shell command to add a new member to replica set) your next launched pod member can’t be sure that “mongodb-0” is still the primary. By the time “mongodb-4” pod is started, because of internal re-election, any previous node might already be the new primary.

All the above should help us understand what’s going on in the following example bash init scripts.

We’re going to use a helm chart (Kubernetes package) as an example of deploying a StatefulSet with three MongoDB replica set members. To install the helm package manager and its server side component Tiller, please follow this official install guide.

If you prefer to skip reading the guide, just run this on the same machine where you have kubectl properly configured (helm uses kubectl configuration to connect to Kubernetes cluster):

 

curl https://raw.githubusercontent.com/kubernetes/helm/master/scripts/get | bash
helm init

 

The first step is to download and install helm. Next, install Tiller in your cluster (helm knows where to install, from $HOME/.kube/config file, and can access Kubernetes API like kubectl using this config file)

If both steps are completed successfully, you will be presented with the message: “Tiller”(the helm server side component) has been installed into your Kubernetes Cluster.” We can then proceed to the installation steps of MongoDB cluster.

At the time of writing, MongoDB StatefulSet helm chart is located in the “incubator” repository, meaning it hasn’t yet been released to “stable” repo. We’ll use this chart as an example to understand how StatefulSet works, and how we can modify it to fit our needs or later run any type of database using the same techniques.

Check which packages are visible to you with “helm search” command, notice you can see only “stable/something” packages. Enable “incubator” helm charts repository:

 

helm repo add incubator https://kubernetes-charts-incubator.storage.googleapis.com

 

You will see “incubator has been added to your repositories,” and by running “helm search” again, verify you see new “incubator/something” packages.

In case you want to change default values provided with this package, download the “values.yaml” file or simply copy its content and replace any values like Storage:”10Gi”,Memory:”512Mi” or Replicas: 3.

Then, during install, command point to your modified file with “-f values.yaml”.

Now you are ready to launch MongoDB replica set with this command:

 

helm install --name mymongo incubator/mongodb-replicaset
Or  
helm install -f values.yaml --name mymongo incubator/mongodb-replicaset

 

(if you have modified default values).

After a few seconds, refer to your Kubernetes dashboard, you should see the following resources created:

  1. StatefulSet named “mymongo-mongodb-replicas”

 

2. “Persistent Volume Claims” and three volumes2. “Persistent Volume Claims” and three volumes

 

3. Three pods named “mymongo-mongodb-replicas-0/1/2

 

4. Next, refer to your StatefulSets again, it should be lit green now because all three pods are initialized.


AWS Notice: If you are running on AWS, the default Kubernetes 1.5 StorageClass will provision EBS volumes in different availability zones. You will see an error such as “pod (mymongo-mongodb-re-0) failed to fit in any node fit failure summary on nodes : NoVolumeZoneConflict (2), PodToleratesNodeTaints (1)”.

If this happens, delete the EBS volumes in ‘wrong’ AZs, and create a StorageClass constrained to a particular availability zone, where your cluster has its worker nodes, using this example:

Create a file named “new-aws-storage-class.yml” with content:

 

kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
  name: generic
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp2
  zone: us-west-2a

 

Submit this to Kubernetes with “kubectl create -f new-aws-storage-class.yml,” and you should see response of “storageclass generic created.”

Now “Persistent volume claims” will dynamically create PVs in “us-west-2a” only (replace “us-west-2a” in this file, with AZ that fits your cluster setup).


Following this package authors advice, we can find which pod is our primary replica, using this bash command:

 

for i in `seq 0 2`; do kubectl exec  mymongo-mongodb-replicas-$i -- sh -c '/usr/bin/mongo --eval="printjson(rs.isMaster())"'; done

 

Then, look at which pod shows “ismaster” : true, and copy its name (this JSON output shows full service DNS names, so single pod name is the left part before first dot).

In my case it’s still the “mymongo-mongodb-replicas-0” pod. We can write some value, into the “master” pod mongo. Them, execute this command:

 

kubectl exec mymongo-mongodb-replicas-0 -- /usr/bin/mongo --eval="printjson(db.test.insert({key1: 'value1'}))"

 

If everything is working, you should see “{ “nInserted” : 1 }

To read this value from any slave pod, execute:

 

kubectl exec mymongo-mongodb-replicas-2 -- /usr/bin/mongo --eval="rs.slaveOk(); db.test.find().forEach(printjson)"

 

You will see something like this:

 

{ "_id" : ObjectId("587117934a44128a6beac820"), "key1" : "value1" }

 

Those basic verification steps prove that your replication is working and a value we inserted into the primary node can be fetched from any of the slave replicas.

Also, you can log into interactive mongodb shell by executing the following:

 

kubectl exec -it mymongo-mongodb-replicas-0 -- /usr/bin/mongo

 

This allows you to perform arbitrary actions on the database and you can safely exit the shell without worrying that your container will close on exit (as if you exit the shell after opening it with “Docker attach”).

Let’s have a quick look at the components which Helm Chart used to create this MongoDB StatefulSet and persistent volumes for storing data:

1. Headless Service YAML manifest. A service that doesn’t have a “ClusterIP” or “NodePort” specified means it doesn’t attempt to load-balance traffic to underlying pods. Its sole purpose is to trigger DNS names creation for pods.

Please notice the annotation used:

 

service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"

 

It will cause endpoint creation for a pod, ignoring its state, which is exactly what we need.

Why? Because we use our own initialization mechanism in MongoDB when a pod starts, and each replica set member must be able to reach others during initialization, even if they’re not yet “healthy” (ready to serve requests and traffic).

2. MongoDB daemon configuration declared in ConfigMap YAML. All extra settings and fine tuning of mongo behavior goes here. This file will be rendered into every pod and used as config file. The file in Helm Chart that we used is very minimalistic and has no special performance tuning or whatever you might need in your production deployment. Feel free to modify it to fit your needs. The simplest method is to git clone the chart’s repository, modify the needed files and definitions, and use “helm package mongodb-replicaset” which will archive your modified “mongodb-replicaset” folder to “.tgz” archive, to use later with “helm install — name your-release-name mongodb-replicaset-0.1.3.tgz”. If you don’t specify — name, you’ll have to live with a random release name helm generates for your set, like “curious-penguin” or the like. You can read more about helm here (highly recommended).

3. StatefulSet YAML manifest, includes Persistent Volume Claims template. This is the most complex resource declaration, which has init containers defined in this section:

 

pod.alpha.kubernetes.io/init-containers: ‘[...]’

 

The two containers named “install” and “bootstrap” are started before the main pod container. The first one named “Install,” loads a small image which holds special files like “install.sh,” “on-start.sh,” and “peer-finder.” This container does two important things: mounts two of your newly defined volumes (one is “config” which is created from ConfigMap and has only the config file, second is a temporary mount named “work-dir”), and copies needed init files to “work-dir.” You can rebuild that image and put anything else that might be needed for your stateful application. Pay attention, “work-dir” is not yet the persistent volume, it’s just a place for running few init files in your pod during next steps.

The second init container, named “bootstrap,” uses mongodb 3.2 official image (by the time you read this, it might be any other new version of mongodb image, but because we do init steps in separate containers, we don’t need to modify real mongo image, we add our extra files using mounts “work-dir” and “config”). It will mount the main persistent volume (which is defined later on, in the “volumeClaimTemplates” section of StatefulSet manifest) to /data/db path. And run “peer-finder,” a simple tool used for fetching other peers endpoints from Kubernetes service API, you can find it here. “on-start.sh” script has the logic which detects if MongoDB replica set is already initialized by other peers and joins the current pod to a replica set. In case it detects no replica set, it will set itself as master and initialize one, so others will join.

Before we can use this in production, the next step is to write more data into our newly created replica set and verify the failover mechanism, in case one of our Kubernetes nodes goes down unexpectedly.

Share your thoughts and questions in the comments section below or visit our site for more information.

Need a user-friendly tool to set up and manage your K8S cluster? Take our demo for a spin.

You May Also Like