Kubernetes for Data Science and Machine Learning

At Kublr we’ve been talking with customers and the community about the workloads they plan to run using containers and Kubernetes. We’re seeing a rapid…

Terry Shea | May 11, 2018

At Kublr we’ve been talking with customers and the community about the workloads they plan to run using containers and Kubernetes. We’re seeing a rapid uptick in interest in using Kubernetes for data science and machine learning applications.

Frameworks from MapReduce to Hadoop to Spark have created parallel processing capabilities that leverage clusters to speed processing tasks. These clusters have been frequently managed with their own cluster management solution (eg. Spark) or with Apache Yarn or Mesos Marathon.

Recent developments in Kubernetes for data science and machine learning include the 2.3 release of Apache Spark with “native” Kubernetes support. Mesosphere, the commercial company behind Marathon, announced its own support for Kubernetes at the end of last year. Google has developed, and of course open-sourced, Kubeflow, “A composable, portable, scalable, machine learning stack for Kubernetes”, for their popular TensorFlow machine learning framework. There are tutorials and Github repositories for running Hadoop on Kubernetes. There’s a Special Interest Group (SIG) in the Kubernetes Community on big data. There’s a hackathon at the upcoming KubeCon conference in Copenhagen for Managing Data in Cloud Native and Data Science. And this list just scratches the surface of the activity.

We think that one of the reasons for the increase in this activity around Kubernetes for data science and machine learning is that it enables IT to better support these applications. Having a common orchestration layer for all containerized applications has several benefits:

Better resource utilization through centralized scheduling of data science and other containerized applications,
- (Potential) Portability for workloads,
- Single scheduling solution for multiple environments, on premise or in multiple clouds,
Ability for IT to create self-service environments for data scientists and other data users.

Kubernetes can also support GPUs to speed parallel processing, and auto-scaling in environments that support it. In fact, Kubernetes provides two types of auto-scaling — pod auto-scaling where more pods are automatically created in a cluster based on scaling rules, and cluster auto-scaling where more nodes are added to a cluster based on flexible rules. With the addition of custom metrics Kubernetes is now able to utilize finer grained scaling rules, such as tasks in a queue, rather than earlier scaling metrics, which revolved only around CPU and memory.

The “native” integration with Spark really is something new. Spark assumes that the Kubernetes clusters already exist and provides a method for creating container images that can be deployed to Kubernetes. Spark-submit can be used to submit an application to Kubernetes. In Kubernetes, one or more containers are placed in a pod. Multiple pods are scheduled per node, and two types of nodes exist — master nodes and worker nodes. With native integration, Spark creates a Spark driver in a Kubernetes pod. The driver creates executors that also run in Kubernetes pods, then connects to them and executes applications. A complete description of the Spark on Kubernetes capabilities can be found at on the Apache Spark project site.

This integration takes advantage of Kubernetes’ resource management capabilities while maintaining Spark as the application-level scheduling mechanism. The roadmap for Spark on Kubernetes includes integrating with advanced Kubernetes features like affinity/anti-affinity pod scheduling parameters. To learn more about advanced scheduling in Kubernetes watch this webinar.

So, where’s all this heading? Most likely to supporting new use cases that may leverage Kubernetes’ broad extensibility. IoT use cases are frequently data science-related on the back-end. At the edge, they may require pre-processing of data, the ability to run nodes on ARM processors, handling low bandwidth and limited connectivity, and automated software deployments. From a sensor or device to an edge node, the communication may use MQTT or a similar messaging protocol, but from the edge node back to the data center or cloud IoT is really a streaming data application.

So, the potential exists for Kubernetes to provide application abstractions that simplify a broad range of use cases while enabling self-healing, infrastructure management. At Kublr, we’ve been working on an architecture that supports this broad range of use cases. Feel free to download Our Demo to set up a Kubernetes Cluster and then test Spark on Kubernetes.

You may also be interested in our blog on data science for the enterprise with Shiny (R).