How Kubernetes Could Orchestrate Machine Learning Pipelines | 7wData

How Kubernetes Could Orchestrate Machine Learning Pipelines | 7wData

As a scalable orchestration platform, Kubernetes is proving a good match for Machine Learning deployment — in the cloud or on your own infrastructure.

The cloud is an increasingly attractive location for Machine Learning and data science, because of the economics of scaling out on demand when training a model or serving results from the trained model, so data scientists aren’t wasting time waiting for long training runs to complete. Ovum has been predicting that in 2019 half of all new big data workloads would run in the cloud and in a recent survey, some 45 percent of organizations said they were running at least one big data workload in the cloud.

That can mean cloud machine learning platforms like Azure Machine Learning Studio, Amazon SageMaker and Google Cloud AutoML that offer built-in data preparation tools and algorithms, or cloud versions of existing tools like Databricks (for running Spark workloads on Azure or AWS) or the upcoming Cloudera Machine Learning service, a version of Cloudera data science Workbench that will run on public cloud Kubernetes services.

The reason Hadoop and Spark have been so popular for data science (and following that, for machine learning) is that they use clusters and parallel processing to speed up the parallelizable parts of data processing pipelines. They’re dedicated software stacks where clusters are managed with the project’s own cluster management solution, like Apache Yarn or Mesos Marathon.

But as Kubernetes has become increasingly popular as an orchestrator to create scalable distributed systems, it’s starting to look increasingly attractive as a way to get the flexibility that data scientists want to use their choice of different machine learning libraries and frameworks, the scalability and repeatability that the team running machine learning systems in production need — with the control of resource allocation (including GPUs for fast training and inferencing) that the operations team requires. Those are the problems Kubernetes already solves for other workloads, and now it’s being applied to machine learning and data science.

Instead of separate data science and deployment paths, where data scientists build experiments with one set of tools and infrastructure and development teams recreate the model in a production system with different tools on different infrastructure, teams can have a combined pipeline where data scientists can use Kubeflow (or environments built on Kubeflow like Intel’s open source Nauta) to use Kubernetes to train and scale models built in frameworks like PyTorch and TensorFlow on Kubernetes without having to be infrastructure experts.

Instead of giving everyone their own infrastructure, with expensive GPU systems tucked under the desk, multiple users can share the same infrastructure with Kubernetes namespaces used to logically isolate the cluster resources for each team. “Distributed training can make the cycle of training much shorter,” explained Lachlan Evenson, from Microsoft’s Azure Containers team. “You want a trained model with a certain level of accuracy and data scientists are changing the model until they get the accuracy they want but with large data sets it takes a long time to train and if they don’t have the infrastructure to scale that out, they’re sitting around waiting for that to complete.”

“In recent years, the price of both storage and compute resources has decreased significantly and GPUs have become more available; that combined with Kubernetes makes machine learning at scale not only possible but cost-effective,” said Thaise Skogstad, director of product marketing at Anaconda.

Images Powered by Shutterstock